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1 Introduction 



Multi-touch technology has recently emerged as one of the most sought after input 
technologies. The presentation of the iPhone and the Microsoft Surface, both in 2007, 
have received considerable attention and touch-sensing technology has since then become 
a de facto standard in consumer electronics. Moreover the presentation of a low-cost 
multi-touch prototype by Han in 2005 based on the principle of frustrated total internal 
reflection has made the required sensing technology available to a large community that 
is continuously driving the development of new multi-touch applications. 

In the course of time multi-touch researchers and developers have come up with an 
ever growing number of gestures used in multi-touch devices, however they generally 
have a common limitation. Most devices assume that all simultaneous touches belong 
to the same gesture, hence support only a single user. While this is without question 
true for small form factor devices like smartphones, things aren't as straightforward 
for large touch screens that can accommodate more than one user. One approach, as 
used in the Microsoft Surface, is to separate multiple simultaneous touches based on the 
touched object and to evaluate gestures for each object individually. Another approach 
is to define public and private areas on the touch-sensitive surface granting each user a 
private workspace. Albeit the aptitude of these approaches they are constraint by the 
data provided from the sensing technology. 

Technologies based on resistive and acoustic approaches are by design limited to sens- 
ing direct surface contact only, while capacitive technology allows to a certain extend 
also the sensing of objects in proximity. Apart from the MERL Diamond Touch(^ which 
can uniquely identify the interacting user for each touch point, these technologies reveal 
solely the position of a surface contact and possibly its size and shape. In contrast 
technologies based on optical sensing are able to provide a much richer representation 
of what happens on and beyond the touch-sensitive surface. Although this entails a 
significantly higher processing performance due to the increased complexity, it opens 
the possibility of providing contextual information for each surface contact. Hence one 
might be able to define whether simultaneous touch points originate from the same or 
different hands. Further information might include whether a finger touch stems from 
a left or right hand and might even help to identify users given the orientation of the 
hand. 

Yet surprisingly little research exists on how to extract these properties from the cam- 
era image. On the one hand processing in most prototypes is limited to a rather simple 
surface contact detection algorithm ignoring all these aforementioned factors. On the 
other hand the existing approaches all try to establish spatial relationships solely from 

^see pagem 
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1 Introduction 

the previously revealed contact position and shape. However no integrated processing 
pipeline exists that combines those features and reveals surface touches together with 
their spatial relationships. 

This thesis proposes a novel processing pipeline for optical multi-touch surfaces that 
fully incorporates these features. Based on the Maximally Stable Extremal Regions algo- 
rithm that has previously been mainly used in stereo image matching and visual tracking, 
processing steps will be described that by design reveal surface contacts as well as their 
spatial relationships. Taking advantage of this structure finger touches can not only 
be attributed to different hands but in presence of all five fingers the algorithm even 
performs a hand and finger registration. 

However before further elaborating the processing pipeline, the next chapter will first 
present a survey of the different sensing technologies and detail the most common pro- 
cessing steps for prototypes based on optical sensing. Afterwards the multi-touch proto- 
type and the novel processing pipeline will be thoroughly described in chapter [3j Chapter 
|4] follows with an evaluation of both performance and accuracy of the proposed algo- 
rithms, while in chapter |5] a conclusion and an outlook on potential future work will be 
given. 
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2 Existing Multi-Touch Technologies 



Although multi-touch devices only recently caught the attention of a wider public with 
the proliferation of touch sensing technologies in smartphones and tablet computers, it 
actually has a very long history ranging back until the early 1980s. Bill Buxton, him- 
self an important contributor to the multi-touch research, names the Flexible Machine 
Interface [12] from 1982 as the first multi-touch system in his much-cited review on 
touch-sensitive devices [8]. The prototype consisted of a panel of frosted glass on which 
the finger's shadows appeared as black spots with their size increasing the more the user 
touched the panel. It is the earliest prototype where a camera mounted on the underside 
of the tabletop was used to capture the surface. Further examples of multi-touch devices 
from the early stages, this time based on capacitive sensing, have been presented by Bob 



Boie in 1984 [7] and one year later by Lee et al. [lO 



Since these early prototypes many other devices have been developed that rely on a 
wide range of different sensing technologies. Hence, this chapter will focus in a first 
section on that variety of sensing technologies, including widely known techniques based 
on capacitive and optical sensing as well as lesser known ones based on acoustic and 
thermal sensing. In a second section the commonly used processing steps required to 
reveal finger touches in a diffuse back-illumination setup, the same as the one used in 
the prototype presented in the next chapter, will be described. This pipeline will be 
considered the de facto standard when introducing the novel processing pipeline. 



2.1 Sensing Technologies 

All existing sensing technologies have a number of advantages and disadvantages when it 
comes to sensing finger touch. The following classification of technologies loosely follows 
the one proposed in ^56j, however adds missing technologies and examples wherever 
required. This section will outline the functioning of resistive, capacitive, acoustic and 
optical sensing and present relevant prototypes. 



2.1.1 Resistive Sensing 

Resistive touch screens are typically composed of two layers with uniform resistive coat- 
ings of indium tin oxide on their inner sides. The lower layer is typically made of glass 
or acrylic while the upper layer consists of a flexible hard-coated membrane. The two 
panels are separated by an insulating layer, usually small invisible spacer dots, so that in 
idle state the conductive layers are not connected. However when pressed by a arbitrary 



object the conductive layers connect and a current is established (see figure 2.1). In 
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2 Existing Multi-Touch Technologies 




Figure 2.1: Design of a surface based on resistive technology. 

order to measure the actual touch position two consecutive electric measurements along 
the horizontal and vertical axis are performed. Measurements are conducted based on 
the two following wire configurations. 

4- wire configuration The 4-wire configuration is the easiest and most common con- 
figuration where the inner and outer layer are used to establish the touch position. The 
resistive coating of the upper layer usually consists of wires running vertically, while 
the coating of the lower layer consists of wires running horizontally. First a voltage is 
applied to the upper layer forming a uniform voltage gradient along the wires. When 
pressed the layers connect and the voltage at the touch point can be measured by the 
horizontal wires of the lower layer. The horizontal touch position is similarly established 
by inverting the process and applying a voltage to the lower layer and measuring on the 
upper layer. 

An important drawback of this configuration is it's reliance for touch measurements 
on consistent electric characteristics of both resistive coatings. However the fiexibility 
of the upper layer will result in microscopic cracks in its coating changing the electric 
uniformity along the vertical axis. Therefore resistive touch screens exhibit a certain 
inaccuracy along one axis over time. 

5- wire configuration The 5-wire configuration solves the shortcoming of diminished 
accuracy by including both vertical and horizontal wires in the coating of the lower and 
more stable layer while the upper layer's coating now consists of a single additional wire 
(therefore the name 5-wire). Similarly to the 4-wire configuration voltage is first applied 
vertically and then horizontally. The touch position is then determined by the voltage 
measured using the wire on the upper layer. Applying voltage only to wires on the 
lower layer ensures consistent and uniform electric characteristics over time. Therefore 
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2.1 Sensing Technologies 



extensive pressure on the flexible upper layer no more degrades touch detection any 
more. 



2.1.2 Capacitive Sensing 

Capacitive sensing relies on the concept of electric capacity occurring between any two 
conductive elements placed in proximity. The resulting capacity is determined by the 
dielectric (the layer filling the gap between the conductors), the surface of the conduc- 
tors and their distance from each other. Its usage for finger touch sensing is enabled 
thanks to the conductive property of fingers and human skin in general as they contains 
conductive electrolytes. Capacitive sensing technology is based on two basic principles: 
self-capacitance and mutual capacitance. 



Self-capacitance Considering two conductive materials separated by an insulator. If 
a charge is applied to one of the two materials, an electric field is created on its surface 
and capacitive coupling occurs resulting in a charge transfer. The amount of charge 
transferred depends on the distance between the two materials and the insulator. 



Mutual capacitance Given two conductive materials that form a capacitor as de- 
scribed in self- capacitance. Their electric field is mainly concentrated between the 
conductive materials, however fringe fields exist outside this region. Consider such a 
capacitor and another conductive material placed in its proximity only separated by 
an insulator. Capacitive coupling occurs due to the fringe fields resulting in a charge 
transfer and a change in electric characteristics of the capacitor. 



Surface capacitive sensing 

Surface capacitance based sensing uses a panel covered by a uniform conductive coating 
on one of its sides. On all four corners of the conductive coating a charge is applied 
resulting in a uniform spread of electrons across the surface. Exploiting the principle of 
self-capacitance a finger approaching the non-coated side of the panel forms a capacitor 
with the conductive coating at the closest point on the surface. Charge is transferred 
between the coating and the finger resulting in current on the surface being drawn 



from the corners (see figure 2.2). Using sensors placed at all four corners, the amount of 
transferred current is measured. As this amount is inversely proportional to the distance 
from the touch position, the exact position can be established using bilinear filtering. 

Although the technology can be made highly durable by using a durable panel as 
surface, it suffers from the drawback of being single-touch only. Furthermore its accuracy 
can be affected by other conductive materials near the touch surface. 



Projected capacitive sensing 

Unlike surface capacitance based sensing, devices based on projected capacitive sensing 
consist of a grid of spatially separated electrodes patterned on either one or two panels |1] . 
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2 Existing Multi-Touch Technologies 




Figure 2.2: Principle of operation of a surface based on surface capacitance sensing. 



This grid is usually composed of electrodes in rows and columns and can be operated 
using the principle of self-capacitance or mutual capacitance. In two-layered devices 
based on self-capacitance a charge is sequentially applied to all rows and columns and 
their capacitance is measured. However this defines touch points only as a set of points on 
the X- and Y-axis without revealing their exact correspondence. This correspondence 
is obvious for single touch, however is undefined when sensing multiple touch points 
resulting in so-called ghost points. 

A far more common, yet more complex approach is to use electrodes patterned into a 
single layer measuring electrode intersections instead of rows and columns of electrodes 
only. Each intersection of electrodes forms a small capacitor when a charge is applied 
one-by-one to all electrodes in columns. This capacity is measured on each electrode in 
rows and remains constant as long as no conductive material is placed in its proximity. 
However given the principle of mutual capacitance a finger approaching the touch surface 
causes a loss of capacitance on electrode intersections in the touched surface region. 
Although the grid of electrodes might seem too large on first glance to create accurate 
touch measurements, one is reminded that the presence and touch of a finger affects 
a large number of intersections to an extend proportionally inverse to their distance. 
Given this set of capacity measurements the actual touch position can be accurately 
determined using bicubic interpolation. 

Rekimoto used this technique to create a large scale touchable surface of 80x90cm 
^T\ . Although he only used an electrode grid of 8x9 cells, hence a cell size of lOxlOcm, 
he achieved a touch accuracy of 1 cm. 

With the advent of multi-touch technology in everyday life, projected capacitive sens- 
ing is now used on most small-scale devices such as the iPhone [2^ . 
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2.1 Sensing Technologies 

MERL DiamondTouch 

DiamondTouch is a touch-sensitive input device developed at Mitsubishi Electric Re- 
search Laboratories (MERL) based on capacitive sensing whose capabilities however 
go beyond the common finger touch detection [13]. Apart from accurately detecting 
multiple touch points it reliably identifies for each touch point the corresponding user. 
Furthermore operation is unaffected by objects placed on or near the device regardless 
of them being conductive or not. 

The device operates using an array of antennas placed below a protective surface acting 
as an insulator to objects above the device. Each antenna is driven with a uniquely 
identifiable signal and shielded from neighboring antennas to avoid signal interference. 
Unlike other devices the electric capacity resulting from conductive objects approaching 
the surface is not measured inside the device itself but is used to transfer the unique 
signal of an antenna to this object. Once the user touches the surface a capacitively 
coupled circuit running from the transmitter (the touched antenna) to a receiver attached 
to the user's chair is completed. 

Given the presence of current at the receiver a multitude of signals coming from dif- 
ferent antennas must be separated by the system. Therefore code-division-multiplexing 
is used to make the antenna signals mutually orthogonal. Nonetheless a received sig- 
nal only indicates the antenna being touched, but not exactly where on the antenna. 
Therefore the antenna size in the proposed design covers only an area of 0.5x0. 5mm, an 
area smaller than the usual finger contact area. Whenever the finger contact area spans 
several antennas at once an even finer resolution might be achieved by interpolating 
between the antennas given their relative signal strength. 

2.1.3 Acoustic Sensing 

Acoustic sensing did only have limited impact in multi-touch sensing technology but 
exhibits some interesting properties and is therefore included here. Recently acoustic 
sensing has been used to complement traditional sensing technologies and to provide 
information on how the surface is being touched [25]. 

Location Template Matching 

Location Template Matching (LTM) is based on the principle of Time Reversal. Time 
Reversal relies on the reciprocity of the wave equation and states that given a recorded 
acoustic signal the source signal at its original location can be reconstructed by focusing 
it back in time (by using negative time) [50l [18] . Therefore the recorded acoustic signal 
itself must be unique as well. Given a single sensor attached to an arbitrary surface, 
LTM exploits this principle by prerecording a certain number of touches associated with 
their location. When a user now touches the surface a cross-correlation is applied to the 
recorded signal to find its best match among the prerecorded ones. Although its main 
advantage being its aptitude to be used with irregular surfaces (see [9]), touch detection 
is constraint to a finite number of contact points. 
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2 Existing Multi-Touch Technologies 
Time Delay of Arrival 

Whenever the user touches a surface acoustic vibrations are created that propagate along 
the surface. The Time Delay of Arrival (TDOA) approach measures the incidence of 
these vibrations at a number of sensor locations. Given the relative delay between distant 
sensor locations the actual touch location can be inferred (see [50] for more details). 
The obvious setup would be a flat surface with four sensors attached in rectangular 
shape, however it could be equally applied to curved surfaces. Nonetheless estimating 
the propagation of vibrations along a curved surface makes calculations much more 
computationally demanding. 

The main advantage of TDOA is its independence from a specific surface material. 
Almost all materials such as wood, glass, plastic or metal can be used as long as they 
transmit acoustic vibrations [9j. On the other hand TDOA is inherently single-touch 
only and furthermore only detects the incidence of a touch but is ignorant about the 
objects state afterwards, whether it is gone or still remains on the surfaces. 



Surface Acoustic Wave Absorption 

The earliest touch surface based on Surface Acoustic Wave (SAW) absorption ranges 
back as early as 1972 |3lj. The surface panel is equipped on all of its sides with an 
array of transducers, one array of transducers acting as transmitter while the array 
opposite acts as receiver. Transmitted surface waves are absorbed by soft objects such 
as fingers attenuating the measured signal at the receiver. Transducers need to cover the 
complete contour of the surface to achieve reasonable detection resolution. To reduce 
the required number of transducers Adler and Desmares proposed the use of an reflective 



array [T]. In this case transducers are placed at the corners of the surface with two of 
them acting as transmitter and sending a burst of waves along the edges of the surface. 
These waves traverse an array of partial mirrors each of them reflecting part of the wave 
vertically inside the surface panel which is then again reflected by a second array of 



partial mirrors on the opposite side to the receiver (see figure 2.3). An initial pulsed 
signal therefore results in a continuous response at the receiver due to differing lengths 
of wave paths. Hence temporal separation determines the response of each single wave 
path. User interaction with the surface therefore results in a signal attenuation at the 
corresponding point in time. 

As SAW attenuation is proportional to the size of the contact area, one can fur- 
thermore infer the amount of pressure applied to a certain contact point allowing to 
distinguish different types of touch. 



2.1.4 Optical Sensing 

Optical sensing has been widely used for touch-sensitive devices because of the richness 
of data that it provides. Particularly it does not limit the number of touch points and 
can sense both objects that are in direct contact with the surface and those that only 
hover above it. However it is exactly the complexity of this data that is sometimes 
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Figure 2.3: Design of a surface based on surface acoustic wave absorption. 



thwarting the intended development of real-time applications. 

Camera-based setups using near-infrared light are among the most popular approaches 
to multi-touch sensing. This is mainly due to the fact that infrared light is invisible to 
the human eye as well as the increasing availability of cheap cameras that provide high 
resolutions and perform at frame rates suitable for real-time applications. Furthermore 
one does not even require a particular infrared camera as digital cameras are sensible to 
near-infrared light which is generally filtered to avoid noise though. However removing 
the IR-cut-filter and adding a IR-band-pass-filter in front of the camera can turn any 
off-the-shelf camera into a fully functional infrared sensing device. 

In the following section most of the described approaches rely on cameras for image 
acquisition hence require a significant throw-distance from the surface. As this might not 
be viable or desirable for all application scenarios, intrinsically integrated devices have 
emerged recently that directly integrate sensing technology into the surface. Finally 
an approach based on thermal sensing is described that might not replace the use of 
near-infrared light for multi-touch sensing but rather complement it given its several 
desirable sensing properties. 



FTIR 

The rediscovery of frustrated total internal reflection (FTIR) by Han and its application 



in multi-touch sensing resulted in increased interest in optical sensing technologies due to 
its simple, inexpensive and scalable design p3]. FTIR relies on the optical phenomenon 
of total internal reflection that can occur when a ray of light strikes a boundary with 
a medium of lower refractive index. The amount of light being refracted depends on 
its angle of incidence with more light being reflected back into its medium of origin the 
more acute the angle. If the angle of incidence is beyond a critical angle all of the light 
is reflected resulting in total internal reflection. However once the initial assumption of 
a lower refractive index is violated by a different medium, it frustrates the total internal 
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Figure 2.4: Design of a surface based on frustrated total internal reflection. 



reflection and light escapes at the boundary. 

This principle is applied to multi-touch devices using a sheet of acrylic or a similar 
material with high optical transmittance. Along the edges arrays of infrared LEDs are 
located that inject light into the panel. Due to the parallel orientation of the LEDs to 
the panel's surface most infrared light is internally reflected. However when an object 
with higher refractive index such as a finger touches the surface light escapes from the 
area of contact. This light is then reflected from the finger and can be captured by an 



infrared-sensitive camera placed behind the surface (see figure 2.4). 

The advantage of this approach is its ability to only identify objects that actually 
touch the surface. However it is at the same time its main drawback as contact areas 
lack any contextual information. Contact areas identical in shape could be caused by the 
same or completely different objects. Contact areas in proximity could be caused by the 
same or again completely different objects. This missing contextual information makes 
identification and classification, such as association of a finger to a hand and ultimately 
a user, extremely difficult. 

As the setup uses a camera to capture the surface a significant amount of space is 
required behind the surface to accommodate the camera. The larger the surface the 
further the camera needs to be located behind it which might not be feasible in all 
application scenarios. This could be overcome using two or more cameras that each 
only capture part of the surface or by using wide-angle cameras. However the former 
would require further synchronization and calibration procedures making the setup more 
complex while the latter would reduce the actual resolution at off-axis points due to 



10 



2.1 Sensing Technologies 



iTTnr-r-nr-r 



IR Illuminator 



Camera 



Figure 2.5: Design of a surface based on diffuse back illumination. 



barrel distortion. 

In [27] a prototype is described that allows incorporation of FTIR technology in com- 
mon LC-screens hence enabling very thin form factors. This prototype is in its setup 



inspired by the work on ThinSight (see page 14 ) and uses as well a large array of infrared 
sensors mounted behind the LC-matrix to capture incoming light. Acrylic fitted with a 
frame of LEDs is placed in front of the LC-screen. They claim to achieve update rates 
of at least 200 Hz with a 1mm resolution however make assumptions on touch pressure 
and finger orientation in order to trigger a touch event. 



Diffuse Illumination 

Diffuse Illumination is a simple design of multi-touch sensing technology requiring solely 
a diffusing surface, a infrared-sensitive video camera and an infrared illuminator. One 
distinguishes two types of diffuse illumination: 

Back Illumination Back illumination describes a setup where both the camera and the 
illuminator are located behind the surface (see figure 2.5). Due to the diffusing 



capabilities of the surface, the infrared light emitted from the illuminator creates 
a well lit area in front of the surface illuminating any object in its proximity. The 
video camera subsequently captures these objects which can then be identified 
using common image processing techniques. 

The inherent simplicity of this approach which can be constructed at low-cost with 
no required engineering skills makes it a widespread alternative to other more so- 
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phisticated technologies. Furthermore this setup can be scaled up to almost any 
size allowing a variety of different application scenarios. However this simplic- 
ity comes with a number of shortcomings. The ability to sense objects near the 
surface, therefore providing contextual information, actually complicates differen- 
tiating touching from hovering the surface. Although objects touching the surface 
should theoretically reflect much more light to the camera, hence providing a 
strong criteria for differentiation, yet it is in practice almost impossible to achieve 
a uniform illumination of the surface. Moreover external infrared light sources and 
stray sun light might interfere with object detection. 

A well known commercial application of this technology is the Microsoft Surfac^ 
(Surface 1) which uses in total 4 cameras to capture the surface. Achieving a 
net resolution of 1280x960 they can detect arbitrary objects placed on the surface 
based on their shape as well as small marker patches. The Surface 1 has now 
been replaced by the Surface 2 which follows an intrinsically integrated approach 



to multi-touch sensing (see page 14) 



Front Illumination Unlike back illumination, this approach places the infrared illumi- 



nator on the opposite side of the video camera (see figure 2.6). Hence the closer an 
object is located to the surface the less light is reflected towards the camera. This 
approach minimizes the interference of external infrared light with object detection 
however might suffer from occlusion effects. Another finger, the hand or even the 
arm might obscure the contact area of a touching finger if placed too close to each 
other therefore largely reducing the contrast in the resulting camera image. Due 
to these shortcomings back illumination is often preferred to front illumination. 

Diffused Surface Illumination 

Diffused Surface Illumination describes a variation of diffused illumination that over- 
comes the shortcoming of uneven illumination |5l] . As for FTIR the setup consists of a 
sheet of acrylic contained in a frame of infrared LEDs, however this time a special type 
of acrylic called EndLighten is being used. This acrylic can be thought of as being filled 
with numerous reflective particles evenly spread across the sheet. When light enters the 
acrylic, it is reflected by these particles evenly to both sides of the sheet resulting in 



uniform illumination of objects in proximity (see figure 2.7). However as light is reflected 
towards the camera as well the captured image exhibits less contrast compared to the 
regular approach. Given the reduced contrast the setup is now much more sensitive to 
external light interference. 

Infrared Light Plane 

The idea behind this approach is to use a setup similar to diffused illumination but 
to only illuminate the space just above the surface. A finger approaching the surface 
and therefore entering the light plane reflects light to the camera mounted behind the 



^see http://www.surface.com 
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Camera 



Figure 2.6: Design of a surface based on diffuse front illumination. 

surface. The light plane can be spanned either by mounting infrared LEDs around the 
surface area or by placing IR lasers with line lenses at all four corners. To a certain 
extend this technology might suffer from occlusion effects as objects blocking a light 
beam reduce the amount of IR light reflected by the occluded object. 

The description of a prototype build using this technology on top of a LCD-screen 
can be found in 

Infrared Light Beam Grid 

Similarly to other grid based approaches, this technology spans a grid of infrared light 
beams across a surface identifying surface contact by measuring occlusion of light beams. 
Displays of any size can be fitted with that technology by mounting a frame of infrared 
transmitter and receivers on top of the display. In this frame transmitters and receivers 
are arranged just above the surface on opposite sides horizontally and vertically. Each 
light beam is tested for occlusion by sequentially pulsing infrared transmitters and simul- 
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Figure 2.7: Design of a surface based on diffused surface illumination. 



taneously measuring incoming light by the receivers on the opposite side. If an object 
or a finger approaches the surface it obstructs the hght path and reveals its presence. 
Given the horizontal and vertical intersections the touch coordinates and an axis-aligned 
bounding box of the contact area can be determined. However similarly to other grid ap- 
proaches this technology might suffer from occlusion effects with multiple, simultaneous 
touch points. 



Intrinsically Integrated Sensing 

In [26] the concept of intrinsically integrated sensing is introduced for technologies that 
integrate sensing capabilities directly into the display and therefore distribute sensing 
capabilities across the surface. While previous camera-based approaches relied on a 
perspective projection of objects, intrinsically integrated sensing can be thought of as 
an orthogonal projection making detection and recognition of objects much more reliable 
regardless of their position. 

In this context an important finding is that LEDs might not only act as light emitter 
but under certain circumstances also as light receiver which goes back to F. Mims in 1973 
IS]. Although not being widely adapted for a long time, [H] presents first applications 
and in pT] it is introduced as sensing technology of a touch-sensitive device prototype. 
Han later presents a similar sensor for proximity sensing [22l [2^ . while in 2010 Echtler 



et al. report to have fitted such a LED array in a 17 LCD-screen [T7] . 



Hodges et al. describe a prototype of intrinsically integrated sensing called ThinSight 
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|26j that is based on an array of retro-reflective optosensors. These components consist 
of both a hght emitter and receiver that are optically shielded from each other. They al- 
low emitting and sensing infrared light simultaneously resulting in a gray level intensity 
image of reflected infrared light above the surface. Similarly to diffuse illumination they 
propose two modes of operation named reflective mode and shadow mode that loosely 
correspond to diffuse back illumination and diffuse front illumination respectively. Re- 
flective mode relies on the emitters to illuminate the area above the surface tracking 
bright regions while shadow mode relies on ambient infrared light for illumination track- 
ing shadows of objects in proximity. Furthermore ThinSight allows novel interaction 
techniques with any simultaneous number of devices that allow communication using 
modulated infrared light such as smartphones or pointing sticks. 

The only commercial application of this technique seems to be the most recent version 
of the Microsoft Surfac^ (Surface 2 or Samsung SUR40). Its main characteristic is its 
very slim form factor that they attribute to a technology called PixelSense |32i il2j which 
is in fact very similar to the described ThinSight above. 



Depth sensing 

The use of two cameras for touch sensing might at first glance seem like an unneces- 
sary overhead due to increased computational demand but several interesting and yet 
performant applications have been developed. 

Touch Light TouchLight [63] is an interactive wall prototype that uses a stereoscopic 
setup combined with a back-projected, transparent surface (DNP HoloScreen). 
While previous setups employed diffuse surfaces that explicitly limit the camera 
to sense solely objects in proximity of the surface, the transparent surface does 
not place any restrictions on sensing. Moreover without diffusion object sensing is 
ultimately more accurate enabling interaction techniques like scanning documents 
placed on the surface or identifying the current user by means of face recognition. 
Instead of focusing just on pointing interaction one is more interested in creating 
a richer representation of objects on the surface describing their size and shape 
using a touch image. 

Given the binocular disparity resulting from the different camera viewpoints ob- 
jects on the projection surface can be accurately extracted. Using simple image 
processing steps such as applying transformations based on homography matrices 
and computing edge filters the touch image for the projection surface can be com- 
puted (see [63] for a detailed description). Furthermore the interaction plane is 
not restricted to lie exactly on the projection surface but could be represented by 
any plane in space enabling what they call Minority Report interfaces. 

PlayAnywhere PlayAnywhere [61] is a compact and portable prototype consisting of a 
projector, a camera and an infrared illuminant that allows to transform any flat 
surface into an interactive area. Sensing of touch is achieved by measuring the 



see http: //www. surf ace . com 
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disparity between a fingertip and its shadow on the surface. As the illuminant 
is placed off-axis from the camera the position of a finger hovering the surface 
differs from its corresponding shadow position while on touch these positions co- 
incide. Furthermore as the distance between illuminant and camera is fixed a 
three-dimensional position of the fingertip might be computed. However this tech- 
nique may suffer from occlusion and in the current implementation applies certain 
restrictions such as supporting only one finger per hand. 

Additionally another approach is described to allow basic interaction techniques 
such as panning, rotation and scaling of the viewport without actually identifying 
a users hand or finger. Given the direct sight of the camera the optical fiow field 
can be extracted from a sequence of camera images from which the corresponding 
transform matrix consisting of scaling, rotation and translation can be derived. 

Kinect Since the Microsoft Kinect, a low-cost yet reliable off-the-shelf alternative to 
much more expensive depth-sensing systems, has become widely available much 
research has focused on this system. In [65] a touch sensing system using a 
Kinect mounted above the surface is described. Although touch sensing might 
seem straightforward with such a system it turned out to be actually impossible 
without previous assumptions. Fingers differ in thickness between each other and 
between people making it hard to set a fixed distance threshold indicating whether 
the finger touched the surface. Furthermore if the finger is oriented at a steep an- 
gle with respect to the surface the finger might obstruct the direct line of sight 
between camera and its fingertip. 

Assuming a direct line of sight between the camera and fingertips pixels are clas- 
sified as touching if they are within a predefined distance from the prerecorded 
background. Although this technique does not match the sensing accuracy of pre- 
viously described approaches it easily allows sensing touch on arbitrarily shaped 
surfaces. Furthermore the shape information of objects above the surface enables 
grouping touch points to hands and ultimately a user. 

Thermal Sensing 

Thermal sensing has so far not been widely explored in the field of touch-sensing de- 
vices (only EnhancedDesk2 in [37] and Thermo Tab let in [33] are known to the author). 
Unlike previous approaches that operate in the near-infrared spectrum, thermal sensing 
measures the amount of far-infrared light emitted by objects that is related to their 
temperature. As presented in [38] thermal sensing has several desirable properties that 
can complement existing approaches of touch-sensing. An obvious advantage is its in- 
dependence of lighting conditions as it works equally well with sun light as with indoor 
lighting or darkness. Therefore it can be used to reliably segment hands from the back- 
ground. A further very interesting property is the heat transfer that takes place when a 
finger touches a surface. Not only is the amount of heat at the contact area a measure of 
touch pressure but also remains at its position for a certain time after the finger has been 
lifted. When dragging a finger across the surface heat traces are left behind indicating a 
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finger's movement trajectories therefore simplifying the differentiation between touching 
and hovering and the classification of even complex gestures. 

The described prototype confirms the robustness of the the previously described fea- 
tures and reveals some more advantages. Most notably a user's hand seems to have a 
unique thermal signature that remains constant over the period of interaction. Therefore 
users can be identified again if they have left the area visible to the camera. Furthermore 
this thermal signature applies as well to objects of different materials that are placed on 
the surface allowing the differentiation of materials. 



2.2 Processing Pipeline 

This section will detail the required processing steps to reveal contact points on a surface 
based on a diffuse back-illumination approach. This processing pipeline will be consid- 
ered the de facto standard and serves as a basis for comparison with the novel processing 
pipeline presented in the next chapter. 

Although the diffuse back-illumination approach is widely used and documented in 
research as well as in the wider multi-touch community, surprisingly little can be found 
about the actual image processing steps in the literature. While a simple set of image 
processing operation appears to be the default approach, it requires certain assumptions 
that may not be valid in all circumstances. The following processing pipeline details 
these essential steps which have been extracted from prototypes described in literature 
and approaches used in the open-source multi-touch community, namely the Community 
Core Vision frameworlsB 



2.2.1 Definitions 

In order to establish a common understanding of the image processing operations in this 
as well as in the following chapter, a number of basic definitions are being presented 
here. These definitions mainly cover notation and simple concepts related to images and 
image areas. 

Blob 

A blob describes a region of an image that is being considered to originate from an 
object that is currently touching the surface. However a blob does not imply any 
sort of semantic with respect to the object being in contact with the surface. The 
terms blob and contact point are both being used interchangeably throughout the 
thesis. 

Image 

An Image / is a mapping of two-dimensional coordinates to an intensity value: 
I : Ci ^ V with Ci (Zl?. Intensity values are generally encoded as 8-bit values, 
therefore V is defined as {0, . . . , 255}. If = A/ x Bi the size of an image I is 
defined as \Ai\ * \Bi\. 



see http : //ccv . nuigroup . com/ 
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Pixel 

A pixel p = [x, y) is the smallest entity of an image /: (x, y) e Cj. It is represented 
by its coordinates {x,y) and its intensity value I{x,y). 

Neighborhood 

For each pixel p in / a neighborhood relation N[p) is defined representing its 
neighboring pixel. Most common in image processing are 4- and 8-neighborhood 
relations. 

4-Neighborhood 

Given an image / with pixels arranged in an orthogonal grid, then the 4- 
neighborhood of a pixel p is defined as follows: 

N{p) = N{x,y) = {(m, n) \ {\ m - x \ + \ n - y \= 1) A {m,n) E Cj} (2.1) 
8-Neighborhood 

Given an image / with pixels arranged in an orthogonal grid, then the 8- 
neighborhood of a pixel p is defined as follows: 

N(p) — N{x, y) — {{m, n) \ (max(| m — x\,\n — y\) — 1)A (m, n) e Cj} 

(2.2) 

Other neighborhood relations exist, for instance in hexagonal grids, however these 
are outside the scope of this thesis. 

Path 

A path between two pixels a and b in an image / exists if and only if there exists 
a set of pixels Pa,b = {a, Pi, ■ ■ ■ ,Pn, b} C Ci such that 

pi e N{a) A b e N{pn) A Vi e {1, . . . , n - 1} I pi+i e N{pi) (2.3) 

Region 

A region i? is a connected set of pixels from an image / such that: 

RCCiA (Va, b e R 3Pa,i, \ Pa,b Q R) (2.4) 

Inner Region Boundary 

The inner boundary ur of a region R is the set of pixels inside the region connected 
to at least one pixel outside the region: 

ujR = {p\peRA {{N{p) nCi\R)^ 0)} (2.5) 

Outer Region Boundary 

The outer boundary Q,r of a region R is the set of pixels outside the region con- 
nected to at least one pixel inside the region: 

^R = {p\piRA {N{p)nR^ 0)} (2.6) 
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2.2.2 Blob Detection 

The first part of the processing pipehne, namely blob detection, reveals all objects that 
are in contact with the multi-touch surface. Some approaches try to limit the number 
of contact points by removing all image information originating from objects that are 
either too large or too small to be considered an intentional touch from a user's finger. 
Nonetheless the basic steps of this part of the pipeline are as follows: 

• Background Subtraction 

• Image Filtering 

• Thresholding 

• Connected Component Analysis 



Background Subtraction 

As for all setups based on infrared illumination, the detection accuracy can be negatively 
affected by ambient infrared light such as stray sun light. Hence the first step is to remove 
interfering light from the image using background subtraction. This operation subtracts 
a previously defined reference image pixel by pixel from a newly acquired camera image. 

The reference image can be acquired in two ways. Firstly a prerecorded image of the 
background could be used as a static reference image. Secondly processing could initiate 
with a prerecorded image and then gradually adjust the reference image to changes in the 
background. The former approach is suitable to setups where the ambient illumination 
can be controlled which generally applies to indoor situations. While the latter approach 
can reduce the effect of background changes it results in an additional computational 
overhead. Furthermore it requires precise timing of when to update the reference image. 
Otherwise objects present on or near the surface might be included in the reference 
image resulting in erroneous detection results. 

A reference image is usually computed using the arithmetical mean of a large number 
of camera frames. A static reference image Rstatic is defined as 

RstaUc{x,y)^^^^^^ (2.7) 

n 

with Ii, . . . ,In being the prerecorded background images. 

A dynamic reference frame FC'dynamic time i using a running average is defined as 

Rdynamicix,y) = ' ^x) * R]^-^^^-^{x , y) + a * P {x , y) ^2 8) 

^dynamici-'^ y y) ~ RstaUc{x,y) 

with P and a being the camera image at time i and the learning rate respectively The 
learning rate defines the speed in which the reference image adapts to changes in the 
background. 
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Image Filtering 

This step is essential to facilitate the following blob detection by removing noise from the 
image. Generally everything that is not related to an intentional contact with the surface 
is considered as noise. Using this definition different types of noise can be distinguished: 

Camera Noise Digital cameras inherently suffer from two types of image noise known 
as fixed pattern noise and random noise that are introduced during the image ac- 
quisition process. These result in slight variations of pixel intensities and mainly 
affect high frequencies in the image. Furthermore, especially in low-end cameras, 
compression artifacts are introduced into the output resulting in a certain "block- 
iness" of the image. 

Debris on the surface Considering a tabletop surface as its name suggests as a table, 
the user might place random objects required for the task onto the surface. As 
these might range from small objects like pens to larger ones like paper, these must 
not result in erroneously detected touch inputs. 

Palm/Wrist/Arm During user interaction surface contact of any body part other than 
fingertips is generally considered unintentional. 

Most multi-touch prototypes solely focus on reducing the effect of camera noise as the 
other two are not part of the expected usage scenario. Therefore it is up to the user to 
not violate these assumptions to ensure a proper functioning of the device. However long 
periods of interaction on large tabletops usually result in arm fatigue largely increasing 
the likelihood of unintentional touches. The effect of camera noise is reduced by applying 
a low-pass filter that eliminates high frequencies. 

Some image processing pipelines are designed to remove the latter two types of noise 
as well (see [66] )■ This is achieved by defining a range of expected input dimensions 
and filtering objects larger or smaller than this range from the input image. As multi- 
touch tabletops are generally used with fingers or a stylus as input modality this can be 
considered a valid assumption. However this approach is incompatible with the use of 
markers located on the bottom side of objects placed on the surface as these would be 
removed during filtering. 

The above mentioned low-pass filtering step in image processing is equal to blurring 
of the image. Simple blurring is achieved by applying a box filter that replaces a pixel's 
value with the average of the area around it. A more sophisticated form is Gaussian blur 
where pixel intensities are weighted according to the Gaussian. Blurring of an image I 
is usually computed using convolution (denoted as *) 

rfc/21 \i/2] 

lB{x,y) = I*K= Yl E H^ + hy + j)-K{[k/'2\+t,ll/2\+j) (2.9) 

i=-[k/2i j=-[l/2i 

with K being the convolution kernel of size k x / having its origin located in its center. 
The size of the kernel defined according to the box filter or the Gaussian determines the 
frequency cut-off resulting from the convolution. 
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An essential property of convolution kernels is separability. This property refers to the 
dimensionality of a kernel and allows the representation of a higher-dimensional kernel 
using a set of lower-dimensional kernels therefore highly reducing its computational 
complexity. Consider a two-dimensional kernel. If the kernel is separable, the result of 
its convolution with an image is equal to the result of sequential convolutions using two 
one- dimensional kernels. Therefore 

K = V*H (2.10) 

with V and H being one-dimensional kernels representing the vertical and horizontal 
projections respectively. Both the box-filter and the Gaussian filter are separable and 
can therefore be efficiently computed in image processing. 

In order to filter objects outside a size range usually a band-pass filter is used. A band- 
pass filter can be thought of as the difference of two low-pass filters while their cut-off 
frequencies define the minimum and maximum size of the target object respectively. A 
band-pass filter based on two box-filters or two Gaussian filters is called Mid-Box or Mid- 
Gauss respectively. In [66] a Dual Quad filter is proposed as an efficient approximation 
of the Mid-Gauss filter. In their evaluation of different filters within a multi-touch 
processing pipeline, they found the Mid- Gauss and Dual Quad to both achieve the 
highest accuracy while the Dual Quad significantly outperforms the Mid-Gauss. 

Thresholding 

Following the previous background removal and filtering steps, the guiding assumption 
of this step is that now the intensity of every pixel relates to the distance of the displayed 
object to the surface. Therefore in this step all those pixels belonging to objects that 
actually touch the surface are being identified using thresholding. 

Thresholding is a simple and widely used image classification operation that assigns 
each pixel based on its intensity to a class from a predefined set. In image processing 
binary thresholding is the most common thresholding type used to differentiate between 
foreground and background. Foreground usually refers to objects of interest while ev- 
erything else is considered as background. 

Thresholding of an image / is defined as 

T{x,y) = { V ,y; - ^2.11) 

I otherwise 

with T being the threshold. 



As illustrated by equation |2.11[ the result of this operation is solely dependent on T 
which is why the appropriate choice of T is crucial. One usually differentiates between 
fixed and adaptive thresholding techniques based on the threshold selection process. In 
the former case T is chosen manually based on experience or another heuristic that does 
not rely on knowledge of the actual image content. The latter however uses information 
extracted from the image content to infer an appropriate threshold value. One of the 
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most well known examples of this selection process is Otsu's method that classifies image 
pixels in two distinct classes such that the intra-class variance is minimized |49j . 

However the accuracy of the thresholding operation might suffer from noise and non- 
uniform illumination. The variation of pixel intensities introduced by noise results in 
misclassification of pixels with intensity close to the threshold value while non-uniform 
illumination prevents the accurate classification of all pixels considered as foreground. 
These shortcomings stem from the application of a single threshold value for all pix- 
els regardless of local image characteristics. Therefore one has to further differentiate 
between global and local thresholding methods. Usually globally operating methods 
are preferred thanks to their superior performance in real-time systems however local 
adaptive thresholding has been successfully employed in the reacTIVision system [5] . A 
vast number of different thresholding techniques exists that are outside the scope of this 
thesis. The interested reader is referred to [5Tj which provides a comprehensive survey 
of existing techniques. 

Returning to the initial assumption that pixel intensities directly relate to object 
distance two important limiting remarks need to be made. 

Uniform illumination is very hard to achieve in practice 

Although being fundamentally related to the above assumption in practice it proves 
to be very hard to actually achieve an uniform illumination of large surfaces. To 
circumvent this problem the number of infrared illuminators is usually kept low 
however even a single illuminator still spreads light unevenly off axis. 

Objects reflect light differently 

Considering the fingers of a hand, significant differences in the amount of light 
reflected upon touch can be observed. The amount of reflected light depends on 
the size of the contact area and the pressure applied by the flngertip hereby for 
instance favoring the index flnger compared to the little flnger. 

Therefore the choice of an appropriate threshold is a challenging task requiring sig- 
nificant manual calibration efforts. Furthermore on large-scale surfaces non-uniform 
illumination becomes a serious issue making the distinction between hover and actual 
touch of an object greatly more difficult. 

Connected Component Analysis 

Up to this step pixels have only been considered individually without considering the 
possible connectivity between them. As objects span more than a single pixel, a set 
of neighboring pixels can be grouped assuming that all those pixels represent the same 
object. Grouping is performed using a neighborhood relation, usually 4-neighborhood, 
and a homogeneity criterion that must be fulfilled by all pixels included in the set. 

Connected Component 

A region R in an image / represents a connected component C if and only if 

C = {p\peRA H{p, I{xp, yp)) = 1} (2.12) 
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with H being a homogeneity criterion function. H returns 1 if the criterion holds 
for a pixel or otherwise. 

As the previous step already performed a homogenization of image pixels by perform- 
ing binary classification of pixels as either foreground or background, the homogeneity 
criterion function is simply equivalent to the result of T{x, y). 

A naive approach to finding connected components in an image would consist of 
scanning image pixels row by row and running a region-growing algorithm once a pixel 
has been found for which H{p, I{xp, Up)) = 1. However due to the unrestricted nature of 
the region-growing algorithm a large number of unnecessary comparisons and boundary 
checks need to be performed. More sophisticated approaches exist such as the sequential 



algorithm presented by |Horn| which will be described in section [3.2.1 

The identified connected components are generally considered valid contact areas re- 
sulting from user interaction. However simple plausibility checks comparing their size 
with predefined minimum and maximum values are still performed in order to discard 
unreasonably small or large contacts. The remaining components Cj are then reduced 
to a single contact point, usually their center of mass q 

h- ^ '^'Rrr v] (2.13) 

2.2.3 Blob Tracking 

Complementary to blob detection that runs on a per frame basis tracking now aims to 
establish intra-frame relationships between the detected contact points. Given the frame 
rates of usually 30 and up to 60 frames per second a simple nearest neighbor matching 
between the sets of contact points of consecutive frames is generally sufficient. However 
in case of unsteady motion or multiple contact points in proximity the application of 
a motion model might be required. Such a model could then be used to predict the 
current position of a contact point hereby improving the performance of the subsequent 
nearest neighbor matching. This approach using a Kalman filter has been described by 



Oka et al. in |38] and is outlined in the following section. 



Hence blob tracking comprises the following two processing steps: 



Given the position of a blob in previous frames predict its estimated position in 
the current frame using a motion model, e.g. the Kalman filter. 

• Using the predicted positions of previous blobs in the current frame, establish 
relationships between blobs using nearest neighbor matching. 

Kalman Filter 

The Kalman filter allows reliable predictions of process states at arbitrary points in time 
both in the past and the future through minimization of the average quadratic error. 
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Furthermore a feedback loop is used that considers the difference between the predicted 
and measured state at the current point in time in future predictions. Basically the 
Kalman filter distinguishes a priori and a posteriori predictions. The former describe 
state predictions by one time step into the future based on the current state. The latter 
include both the a priori prediction and the actual measurement hereby taking into 
consideration the prediction error of the a priori prediction. 

According to [62] two groups of equations can be distinguished: time update and 
measurement update equations. The former serve the prediction of states while the 
latter take into account the actual measurements. Given these two groups of equations 
the following algorithmic design is described: 

The prediction starts with the calculation of the state at time k + 1: 

= A-Xk^i + B -Uk^i (2.14) 

with and Xk being the a priori and a posteriori states at time k respectively. A 
denominates the state transition matrix while Uk and B represent deterministic influences 
onto the system and its dynamics between consecutive time steps. 

Additionally to the state the error covariance is predicted using the following formula: 

= APk.iA'' + Q (2.15) 

with and Pk describing the a priori and a posteriori error covariance respectively. 
Q represents the process noise covariance. 

In order to calculate the a posteriori values within this model, the following measure- 
ment update equations are used. First the Kalman gain is calculated: 

Kk = Pk ■ H^iHP.H^ + R)-^ (2.16) 

H corresponds to the observation matrix and R to the measurement noise covariance. 
During the next step the a posteriori state is determined using the following equation: 

Xk = + Kk{zk - Hxl_^) (2.17) 

Zk represents the state measured at time k. 

Finally the a posteriori error covariance is calculated: 

Pk = {I-KuH)P^ (2.18) 

For a more detailed overview on the Kalman filter, the extended Kalman filter and 
the initial value problem the interested reader is referred to |62j. 

The Kalman filter is being applied in a wide range of application scenarios and can 
be easily implemented due to its recursive nature. However the choice of appropriate 
initial values and the computational complexity of its calculations represent important 
disadvantages especially in the case of real-time applications. 
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Nearest Neighbor Search 

Nearest neighbor search is used to estabhsh the relationship between contact points in 
successive frames. Be Pj = pi^i, . . . ,Pm,i and P^+i = . . . ,Pn,i+i the contact point 

sets at frame i and i + 1 respectively. As nearest neighbor matching can be represented 
as an optimization problem, the optimal solution corresponds to the set S G Pi x Pj+i 
that minimizes the following equation: 

e = ^ II Pi+i -Pi h (2.19) 
(Pi ,pi+i)e5 

This problem can be solved using a greedy approach that works as follows: 

1. Be f2 = Pj X Pj+i the set of all possible combinations of Pj and Pi+i- S = 0. 

2. Find {pi,Pi+i) G Q such that 

W{qi,qi+i) I II pi+i -pi \\2<\\ qi+i - Qi h 

3. Add {pi,Pi+i) to the set of solutions 

S = SU {puPi+i) 

4. Remove all combinations that include either pi or pj+i from VL: 

n = n\{{pi,q) I q e P^+l}\{(g,Pm) k e 

5. If ^ goto|2] 

However due to the polynomial complexitjj^ of the algorithm search can be costly for 
large n. 



^The complexity can be shown to be bound by ©(n^ • log n) assuming m = n without loss of generality. 
First the computation of the distance matrix of all possible combinations of points from Pi and P^+i 
requires calculations resulting in complexity of 0(n^). As this matrix is symmetric this number 
can be reduced to ^ which does not affect the computational analysis though. In the next step 
these distances are sorted which can be achieved in 0{n^ -logrt). Finally iterating all distances takes 
again ^ operations. Hence the upper bound 0{n^ ■ logn). 
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At first glance, one might be wondering as to why one would envisage to develop a novel 
processing pipeline for optical multi-touch tabletops. In the last years many new devices 
have been presented that all rely on the same simple steps to process user input acquired 
by the camera. So there doesn't seem to be much wrong with that approach. However as 
[Benko and Wigdor point out current multi-touch prototypes exhibit shortcomings that 
severely impact usability and hence result in increased user frustration [6j. They argue 
that for multi-touch systems every touch counts and hence user interaction is easily 
affected by accidental inputs (also see for more detail). In these circumstances users 
however are only confronted with the resulting unintended application behavior and 
are left guessing on how they caused that erroneous behavior. Since they will be asking 
themselves what they have done wrong this is obviously a frustrating user experience and 
reduces acceptance of the technology. Most commonly these accidental inputs originate 
from objects placed on the table or parts of the user's arm touching the tabletop during 
interaction. Furthermore on interactive tabletops users instinctively tend to lean forward 
and rest their arms on the table as if they were sitting or standing around a regular table 
hereby increasing the probability of such accidental touches [21] . 

The problem however is not the user performing in unexpected ways but the general 
assumption that every touch is equal. Given that assumption unintentional touches have 
a too high potential to interrupt user interaction while multi-user interaction is made 
quasi impossible. Furthermore with the continuous growth of interactive tabletops multi- 
user interaction will be the de facto standard and hence the aforementioned assumption 
can no longer be considered valid. Nonetheless some approaches exist to tackle that 
issue as will be described in section 3.2.4 but they all simply consist of a processing 
layer added on top of the processing pipeline. However we are highly convinced that a 
processing pipeline that integrates both detection of user input as well as hierarchical 
clustering of these will in the end be a much better and consistent solution to that 
problem. 

Therefore a multi-touch tabletop prototype based on diffused illumination will be 
presented here that acts as proof-of-concept of a novel processing pipeline. At the heart 
of the pipeline is an algorithm called Maximally Stable Extremal Regions (MSER) which 
has been widely used to extract distinguished regions in an image that exhibit certain 
unique properties. While these properties will be used to detect contact points the 
algorithm design furthermore allows the hierarchical structuring of these regions that 
then will be exploited to infer a clustering of contact points. The resulting clusters are 
subsequently enriched with additional properties as to whether these contacts originate 
from a left or a right hand and map detected contact points to the fingers of the hand. 

First the design of the tabletop prototype will be presented. Then the major part 
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of this chapter will be dedicated to the elaboration of the proposed processing pipeline 
while in the end details of the implementation will be outlined. 



3.1 Multi-Touch Tabletop Prototype 

The proposed prototype is based on a setup previously developed by the Virtual Reality 
Systems Group at the Bauhaus University Weimar. This setup consisted of a DLP 
projector for image display and a diffuse front-illumination approach for multi-touch 
sensing. However the latter approach had two significant shortcomings: 

• Due to the large display size several infrared illuminators would be required around 
the table in order to reduce shadow effects. 

• As the infrared illuminators are external to the tabletop they increase the setup 
complexity and thwart the desired compactness of the prototype. 

Therefore a diffuse back-illumination approach was chosen for the new prototype hence 
integrating both image display and multi-touch sensing into a single device. The setup 



design is illustrated in figures |3.1| and |3.2[ The different components of the prototype 
are detailed below: 



Display The employed projector is from a dismantled Samsung DLP HL-67A750 67" , 
a high-definition TV capable of displaying active stereo 3D images. Usually projector- 
based prototypes require a significant amount of space behind the display surface for 
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Camera 
Projector 



IR Illuminator ^ 

Figure 3.2: Illustration of the prototype setup. 



the projector to cover large surfaces. However thanks to the short throw-distance of this 
projector a very compact design could be achieved. 

Camera For image acquisition a Point Grey FireFly MV IEEE 1394a fitted with a 
fisheye lens is used. It is highly sensitive to the near-IR spectrum while achieving suffi- 
ciently fast frame rates (752x480 at 60fps and 320x240 at up to 122fps). As the camera 
is also sensitive to visible light it had to be fitted with a band-pass filter that blocks 
light outside the near-IR spectrum. Although many commercial filters are available a 
piece of developed photo film has been found to work best. 

Infrared Illumination As the mirror used in the optical path of the projector covers a 
significant amount of space just below the tabletop surface uniform infrared illumination 
subsequently proved to be challenging. Moreover the illuminators had not to be directly 
visible to the camera as overexposure to infrared light might negatively affect sensing of 
neighboring areas. Therefore possible locations for infrared illuminators were restricted 
to either beside the camera or below the large mirror. However the first idea had to be 
dropped due to the large form factor and the limited viewing angle of the infrared illu- 
minators at our disposition during setup design. Hence we opted for the latter approach 
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using an array of infrared illuminators. The employed illuminators (IR SW 77) each 
consist of 77 LEDs emitting infrared light of 850nm wavelength. In order to spread the 
light across the whole surface a curved mirror was added to the light path as illustrated 
in figure |3.2 



3.2 Processing Pipeline 



an 



Considering the commonly used processing pipeline as described in section 2.2.2 
efficient method for blob detection exists given the assumption of uniform illumination. 
This approach is optimized to detect click and drag gestures of finger-like objects on 
the surface hence supporting the point-and-click metaphor of graphic user interfaces 
known from desktop computers. However if one tries to go beyond this metaphor and 
use other properties for user interaction such as which contact points belong to the 
same hand one finds oneself easily restricted by the default approach. This information 
however is essential in multi-user scenarios in order to properly interpret the user input. 
Since the traditional approach does not provide any further information other than the 
contact point itself additional processing is required to establish that information which 
obviously increases the computational cost of the pipeline. 

Besides the restrictions on interaction metaphors further constraints are applied on the 
allowed hand posture during user interaction. The traditional approach usually requires 
the fingertips to be the nearest body parts to the surface in order to ensure proper 
functioning and avoid unintentional touch points. Obviously expert users familiar with 
the functioning automatically adopt that posture while novice users might encounter 
those pitfalls discouraging them in their further use. However keeping fingers in an 
upright angle during multi touch interaction leads to strain and fatigue in long term 
usage. In this case the user would normally want to rest the palm or the whole hand 
on the surface however hereby disturbing the touch detection and possibly triggering 
unintended behavior. 

Similarly to resting the hand on the surface any other object placed onto the table 
might cause the same unwanted behavior. This however fundamentally contradicts the 
metaphor of a table which would allow placing objects such as tools necessary during the 
work process onto the table. Ideally the processing pipeline would differentiate between 
objects that are in direct interaction with the interactive tabletop such as those with an 
optical marker or unrelated such as pen and paper. 

Additionally to the aforementioned shortcomings when it comes to user interaction, 
from a technical viewpoint the assumption of uniform illumination which is essential for 
the traditional processing pipeline can be said to be impossible to achieve. Especially 
if tabletop displays grow in size or if only limited space is available behind the surface 
illumination properties tend to vary significantly across the surface as described in [20] . 



The same holds true for the presented prototype as can be seen in figure |3.3[ 

This thesis proposes a new approach to contact point detection that not only identifies 
objects on the surface but additionally provides spatial and hierarchical information 
while being robust in the presence of non-uniform illumination. Instead of processing 
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Figure 3.3: Camera image visualized as intensity height map. The uneven illumination 
is reflected in the deformed ground plane as well as in the highly unequal 
light response of all the flngers currently touching the surface. 



the camera image with image filters and thresholding to detect objects in contact with 
the surface, this approach is based on the analysis of distinguished regions in the camera 



image. At the core of the processing pipeline is an algorithm introduced by Mat as 



et al. called Maximally Stable Extremal Regions that reveals the hierarchical structure 
of extremal regions in an image. An extremal regioiij^is a particular type of distinguished 
region and can be defined as a set of connected pixels that are all of higher intensity 
than pixels on the region's outer boundary. Given the fact that objects in diffuse front 
illumination setups appear brighter the closer they get to the surface the representation 
of image content in terms of extremal regions provides some compelling advantages: 

• An extremal region only defines a relative relationship between the contained pix- 
els and those surrounding the region. Since it is independent from absolute in- 
tensity values it is more robust in the presence of non-uniform illumination than 
approaches relying on a global threshold. 

• One or more extremal regions can again be included in a larger extremal region 
hereby organizing distinguished image regions in a tree structure. That structure is 
later used to reveal relationships between objects such as grouping contact points 



see page 



41 



for a mathematical definition. In their original paper the concept of extremal region also 
includes the inverted case that all pixels inside a region are of lower intensity than those on the 
region's boundary. However this case is not relevant here and has therefore been omitted for the 
sake of simplicity. 
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belonging to the same hand. 

Extremal regions can be considered a very reliable and stable object representation 
as they have been successfully used as object features in visual tracking in jl6j . 



The complete processing pipeline is illustrated in figure 3.4 and comprises the following 
steps. 

Distortion Correction Reduce the distortion inherent to the camera optics and trans- 
form the resulting image such that the multi-touch surface covers the whole image. 

Illumination Correction Reduce the effects of ambient illumination by normalizing in- 
tensity values across the image. 

Region of Interest Detection Detect candidate regions for further processing to reduce 
the computational cost of subsequent processing steps. 

Maximally Stable Extremal Regions Analyze the regions of interest for extremal re- 
gions and create the hierarchical structure. 

Fingertip Detection Analyze the hierarchical structure and a number of image features 
computed from the extremal regions to identify fingertips touching the surface. 

Hand Distinction Group all revealed fingertips from the same hand into clusters. 

Hand and Fingertip Registration If all fingers of a hand are simultaneously touching 
the surface, a hand and fingertip registration process is performed. The process 
distinguishes between left and right hands and maps the revealed fingertips to the 
five fingers of a hand. 

Tracking Establish intra-frame correspondences between hands and fingertips from con- 
secutive frames. 



3.2.1 Preprocessing 

The preprocessing steps transform the incoming camera image into a normalized form 
in order to reduce image variations that would negatively impact performance of the 
following processing steps. These steps take into account both spatial and intensity 
transformations of the camera image that stem from intrinsic parameters of the prototype 
such as image distortions due to the camera and extrinsic parameters such as ambient 
illumination. 
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Figure 3.4: Diagram of the processing pipeline. The arrows illustrate the use of concur- 
rency within the pipeline. 
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Distortion Correction 

In order to evaluate input on the multi-touch surface, a uniform and undistorted view of 
the surface is required. However the representation of the surface in the image acquired 
from the camera has been subject to a number of deteriorating effects: 

Barrel Distortion 

Since the employed camera uses a fisheye lens to capture the multi-touch surface 
from short range, it inevitably introduces barrel distortion. This type of distortion 
is characterized by a bulging of usually straight lines resulting in a magnification 
effect of objects towards the image center. 

Perspective Transformation 

As the camera views the surface from a non-orthogonal angle, the image represen- 
tation of the surface has undergone perspective transformation. Hence the surface 
appears to decrease in size with increasing distance. 

Moreover it is in the end solely the image part covering the surface that is of any 
interest to us. Hence an image transformation is to be found that rectifies that image 
part while correcting the aforementioned effects as well. 

The usual approach is to define a set of points in the image that cover the entire 
surface and are known to have formed a rectangular grid prior to being distorted. This 
approach has the advantage of ignoring the origin of the distortions and reliably works 
in our scenario. The grid points have been defined in our prototype by covering the 
surface with a printed checkerboard pattern where the corners of black and white squares 



represent the rectangular grid (see figure 3.5). Contrarily to what one might think, the 



next step is not to find a transformation that maps points from the distorted to the 
undistorted set, but to compute the inverse transformation that allows us to find for 
each pixel in the undistorted image its exact correspondence in the distorted image. 

Be Pm,n and Qm,n the positions of the points in the distorted and undistorted images 
respectively. Since we know that the distorted points used to form a rectangular grid, 
a two-dimensional uniform bicubic B-spline interpolation will be used that defines a 
mapping between the parametric coordinates u,v and their distorted position {x,y) as 
follows: 



2 2 



f2{u,v) = {x,y) = Bi+i{u)Bj+i{v)Pm+i,n+j (3.1) 

i=-l j=-l 

where {u, v) are the parametric coordinates inside the grid cell formed by the four 
points Qm,n,Qm+i,n,Qm+i,n+i,Qm,n+i- Cubic B-spliue interpolation defines four ad- 
ditional blending functions Bi that determine the weight for each point and are defined 
as follows: 

Boiu) = 1(1 -uf 

B2{u) = l{-3u^ + 3u^ + 3u + 1) 
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m 

(a) (b) 
Figure 3.5: Illustration of the set of grid points in the distorted and undistorted image. 

Since the resulting distorted coordinates most likely fall in between actual pixels in the 
camera image, bilinear filtering on the intensity value is employed to interpolate between 
neighboring pixels. Obviously in a real-time processing pipeline the above bicubic B- 
spline interpolation would not be computed again at every frame. Unless the internal 
setup changes, the mapping between distorted and undistorted remains constant and 
only needs to be computed once. 




lumination Correction 



As detailed in section [2. 2. 2| illumination correction is an important step in the processing 
pipeline in order to achieve accurate results. While usually only background subtraction 
is performed more care needs to be taken here due to the following specificities of this 
setup that influence the illumination of a point on the tabletop surface: 

Tilting Effects The tabletop surface is tilted with respect to the camera in this setup. 
Usually the camera is aligned such that camera axis and surface plane are at a 
right angle. This reduces perspective distortion effects however in order to preserve 
the compactness of the previously developed prototype this has not been changed. 
While perspective distortion can be easily corrected a more important effect is the 
intensity fall off resulting from the tilted surface. 

Light rays reflected from objects on and above the surface are transmitted to the 
camera passing through the acrylic surface. At both sides of the surface these 
rays are subject to reflection and refraction. According to Snell's Law the incident 
angle and the refracted angle of a light ray at an interface are proportionally 
related. Hence the more oblique the angle between camera and surface the more 
deviates the incident angle from the surface normal. However as the incident 
angle determines the amount of internal refiectior0 inside the surface panel one 



^Thc amount of internal reflection rises with increasing incident angle until a critical angle that depends 
on the adjacent materials of an interface. Beyond the critical angle all light is reflected inside a 
material resulting in total internal reflection. This phenomenon is exploited in FTIR multi touch 
setups as described on page [9[ 
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can conclude that camera angle and intensity fall off are directly related. 

Vignetting Effects Wide-angle cameras such as the one used in this setup tend to be 
affected by vignetting effects. These describe non-uniform transformations in the 
image formation process due to several mechanisms related to the camera system. 
An often observed phenomenon is the irradiance fall off in the image periphery 
which can be attributed to vignetting. Vignetting occurs when the beam of in- 
cident light is blocked by internal parts of the optical system which is especially 
relevant for light arriving at oblique angles at the lens. Vignetting is prominent 
for larger apertures and generally absent for smaller apertures [2|. Another aber- 
ration however less important than vignetting is cosine-fourth. Cosine-fourth law 
describes an off-axis intensity fall off due to the foreshortening of the lens with 
respect to a point in space [HZ] ■ Another phenomenon is pupil aberration resulting 
in a non-uniform intensity distribution due to non-linear refraction of incoming 



Considering these effects it is obvious that a simple shifting of intensity values on a 
per-pixel basis as in background subtraction is not sufficient. Assuming that intensity 
variations due to the aforementioned effects can be represented for each pixel as affine 
transformations the relationship between original (/'(x, y)) and captured intensity values 
{I{x,y)) can be described as follows: 



with a and m being the additive and multiplicative component respectively (a and m 
are also referred to as noise and intensity inhomogeneity effect respectively). 

As a trade-off between performance and accuracy a min-max normalization approach 
was chosen using precomputed minimum and maximum intensity images. Similarly 
to background subtraction the minimum intensity values are defined by a prerecorded 
background image. Reasonable maximum intensity values can be found by moving a 
sheet of paper across the surface and to recording the maximum intensity values for 
each pixel. Given a maximum intensity image Imax and a minimum intensity image Imin 
the normalized intensity value Inormaiized is computed by 



with b representing the image bit depth. The maximum intensity image depends solely 
on the internal configuration of the prototype, hence requiring no recalibration unless 
the setup changed. The minimum intensity image similarly to the static background sub- 



cantly. However the subsequent processing steps have been found to be robust enough to 
be unaffected by even significant illumination changes due to stray sun light. Illumina- 
tion changes during operation might in some circumstances result in I{x, y) < Imini^, u) 



rays [2]. 




(3.3) 




(3.4) 




^see page 19 
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or I{x,y) > Imax{x,y). Therefore the value Inormaiised{x,y) needs to be clamped to the 
range (0,2^( to ensure proper functioning in the following stages. The effect of this 



normalization steps is visualized in figure 3.6 



Region of Interest Detection 

Considering that users usually only interact with a limited area of the tabletop surface 
at a time it seems like an unnecessary overhead to process the complete camera image at 
every frame. Therefore a simple region of interest detection is performed that provides 
the following advantages: 

• The following fingertip detection algorithm is limited to the relevant image areas 
that might represent an object or a hand on or above the surface hence significantly 
reducing computational costs. 

• This processing step results in a set of independent regions of interest. Given 
that independence those regions can be processed in parallel therefore taking full 
advantage of the threading capabilities of modern CPUs. 

As the tabletop surface is illuminated from below objects above the surface appear 
brighter the closer they get to the surface. The previous normalization step transformed 
pixel intensities such that a pixel's intensity is zero if it does not represent an object or 
larger than zero the closer it gets to the surface. Hence a simple thresholdirifj^opeiaXion 
on the intensity values seems appropriate to reveal regions of interest in the image. 
Although this approach is sensitive to illumination changes erroneously interpreting these 
intensity variations as user interaction, this only reduces the performance advantage 
without affecting the actual detection accuracy. 

Thresholding of an image / is defined as 

T(.,y)J' (3.5) 
I otherwise 

with T being the threshold. 

So far areas of interest in the camera image have been established on a per pixel 
basis only. Therefore pixels are now being grouped in distinct clusters according to a 
neighborhood criterion (throughout the processing pipeline a 4-neighborhood will be 
used). Those clusters are usually referred to as connected components. Generally a 
connected component represents a region in a binary image for which all its pixels have 
an intensity of 1. However a broader definition of such a component will be used here: 

Connected Component 

A region R in an image / represents a connected component C if and only if 

C = {p\peRA H{p, J(xp, Vp)) = 1} (3.6) 



see page 21 for a detailed description of the thresholding operation 
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(b) 

Figure 3.6: Camera images before and after the normalization step visualized as intensity 
height maps. Fingertips are clearly visible as small spikes while palms and 
arms are larger in size and have a rather blunt appearance. 
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Figure 3.7: (a) Pixel mask used in the connected components algorithm. 

(b) Case where two previously disconnected components will be merged. 



with H being a homogeneity criterion function. H returns 1 if the criterion holds 
for a pixel or otherwise. 

Thresholding represents the simplest form of a homogeneity criterion that is defined 
as 

H{p, I{xp, Vp)) = T{xp, Vp) (3.7) 

Connected component algorithms seek to find all connected components Ci, . . . ,Cn 
in an image / such that these components are maximal, i.e. no more neighboring pixels 
can be added to this component: 

\JpencAH{pJ{xp,yp))=Q (3.8) 

The simplest of all connected component algorithms works by starting at an arbitrary 



pixel and to recursively scan the image (see section 2.2.2). However as the component 
grows randomly in all directions it hardly exploits any cache locality since images are 
generally stored sequentially. Therefore a sequential connected components algorithm 
will be used here as proposed by Horn in 



The algorithm scans the image row-by-row, i.e. left-to-right and top-to-bottom, and 
therefore accesses image pixels in the same order as they are aligned in memory. When 
a pixel is accessed it is evaluated according to the defined homogeneity criterion and 
a label representing the connected component is added to the pixel. If the criterion 
does not hold true the label is simply zero. Otherwise neighboring pixels need to be 
considered in order to decide on a label for the current pixel. During execution all pixels 
above and left from the current pixel are known to have been evaluated before. As it is 



known to which components these pixel belong, the 2x2 matrix as shown in figure 3.7a 
is sufficient to label a pixel. Depending on the previously labeled pixels, the following 
cases may arise: 

1. Neither B,C or D have been labeled 

Create a new label for A 

2. B or C has been labeled 

Label A according to the label of B or C 
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3. B and C have been labeled 

Two cases need to be considered: 

• If B and C have the same label then just copy this label. 

• If B and C have not the same label then the equivalence of these two labels 

as 



is recorded. The label of B or C may be copied to A. (see figure 3.7b 
illustration of this case) 



4. Only D has been labeled 

If using a 8-neighborhood simply copy the label from D otherwise go to case [T] 

After the first processing pass the whole image has been processed and each pixel has 
been assigned a label representing its connected component. However due to case [3] there 
might exist a number of pixels that all belong to the same component however have been 
assigned different labels. Therefore in the original algorithm a second processing pass is 
being performed to achieve a consistent pixel labeling. During this pass all equivalent 
labels would be replaced with a new unique label. 

The following processing steps are also based on a 4-neighborhood criterion and require 
the image labeling as an accessibility mask only. Since detected regions are by definition 
not connected, a consistent labeling is not required and the second processing pass can 
be omitted here. However a number of different properties is computed for each region 
during the first processing pass: 



Pixel Count The pixel count will be used to discard regions that are too small and are 
considered to originate from noise. 



Intensity Histogram The intensity histogram will be used to efficiently organize data 
structures used during the execution of Maximally Stable Extremal Regions. 



Brightest Pixel The brightest pixel of the region will be used as a starting point for 
the subsequent Maximally Stable Extremal Regions algorithm. Choosing that 
pixel as a starting point slightly increases the performance of the first steps of the 
algorithm. 



In order to keep track of these features the label assigned to a pixel uniquely identi- 
fies a data structure containing this information. As regions, as in figure 3.7b , might 
merge during the algorithm the data structure additionally carries a reference to its root 
structure into which the new information will be merged. The root data structure is by 
definition the one that started a region, hence has been created earliest (the red region 
in the case of figure 3.7b). Consider that the blue and red regions from the figure have 
been merged. If subsequently another region is to be merged into the blue one, it is 
going to be merged into the root structure instead (belonging to the red region), as the 
reference has been set during the previous merging. 
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3.2.2 Maximally Stable Extremal Regions 

Maximally Stable Extremal Regions (MSER) is a widely used blob detection algorithm 



first described by Matas et al. in [H]. It robustly detects distinguished regions, i.e. re- 
gions that can be detected in image sequences or multiple views of the same scene with 
high repeatability. They then used these regions in order to establish correspondences 
in stereo image pairs. In the proposed algorithm they introduced a new type of distin- 
guished region called maximally stable extremal region. This concept is an extension of 
extremal regions which are defined as follows: 

Extremal Region 

Given a region R in an image /, this region is called extremal if and only if all 
pixels within this region are either of higher or lower intensity than the pixels on 
the region's outer boundary: 

(Vp eR,WqenR\ I{xp,yp) < I{xg,yg)) V (Vp G i?, Vg G Q^. I H^p^Vp) > li^q^Vq)) 

The concept of maximally stable extremal region additionally includes a stability 
criterion based on a region's relative growth. 

Maximally Stable Extremal Region 

Be regions in an image / such that Ri-i G Ri G Ri+i- Region Ri 

is a maximally stable region if and only if Ri_i, Ri, Ri^i are extremal regions and 
the following stability property attains a local minimum in i: 

s{z) 

In order to adjust the required level of contrast the criterion is usually extended 
to contain a user-defined constant A: 

sa(^) 



Ri+l 




Ri-i 




Ri 





Ri+A 




Ri-A 




Ri 





Matas et al. describe the detection process of these regions informally as follows. Be / 
a gray-level image thresholded on all possible intensity values i G {0, . . . , 255} resulting 
in 256 binary images Bi. Bq contains only black pixels while in i?255 all pixels are white. 
Hence, when iterating through Bi with ascending i the initial black image gradually 
turns into a complete white image with new white blobs appearing and existing blobs 
increasing in size. Considering the connected components from all thresholded images, 
each of them corresponds to an extremal region. Furthermore each extremal region R^ 
at intensity threshold i is contained by exactly one extremal region from each Bj with 
j > i, since connected components do not decrease in size with increasing intensity 
threshold. 



The algorithm can be illustrated as shown in figure 3.8 and exhibits a certain resem- 



blance to a watershed segmentation]^ Although the algorithmic design is similar the way 



^See (53j for a complete description of the watershed transform 
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(c) 

Figure 3.8: Visualization of the Maximally Stable Extremal Region algorithm as a rising 
waterfront. Left: The image represented as intensity height map where 
the waterfront indicates the current threshold. Right: The outline of the 
thresholded image at the water level. Outlines from previous thresholds 

42 have been overlaid. 
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regions are selected differs significantly. The set of regions returned by the watershed 
algorithm forms a partition of the whole image whereas regions revealed by MSER might 
overlap and usually only cover a limited areas of the image. However they mainly differ 
in when a region is considered interesting. Watershed segmentation focuses on threshold 
levels when previously distinct regions merge. These are highly unstable and therefore 
inapt as distinguished regions. In contrast MSER chooses regions at threshold levels 
where the size of the region remains unchanged within a defined intensity range. 

As described above an extremal region at threshold level i is contained by exactly 
one extremal region at each higher intensity level. However not all extremal regions 
are maximally stable, therefore it follows that a maximally stable extremal region at 
threshold level i is included in at most one maximally stable extremal region from each 
Bj with j > i. Hence, maximally stable extremal regions can be linked according to 
a parent - child relationship. Furthermore as regions merge with increasing threshold 
intensity a parent can have more than one child, however a child region has at most one 
direct parent region. Thus maximally stable extremal regions can be associated in a hier- 



archical structure called component tree. Such a component tree is displayed in figure 3.9 



According to Matas et al. maximally stable extremal regions are characterized by the 



following properties: 

1. Illumination invariant as they are invariant to affine transformation of image in- 
tensities 

2. Viewpoint independent as they are covariant to adjacency preserving transforma- 
tions on the image domain 



43 



3 Multi-Touch System 



3. Stability due to the stability criterion 

4. Multi-scale detection 
Algorithm 

The algorithm as described in [H] proceeds as follows: 

1. The algorithm can identify either extremal regions of lower intensity than its 
boundary pixels or those that are of higher intensity, however not both at the 
same time. Hence: 

a) In the former case continue the execution of the algorithm normally. 

b) In the latter case invert the source image hereby converting between the 
two types of extremal regions. This is the relevant case in this processing 
pipeline. 

2. Pixels are sorted by ascending intensity. As intensity values are bound to 
the range from to 255, they propose BinSort for this step which has a 
computational complexity of 0{n) with n being the number of pixels. 

3. Iterate through all intensity values, i.e. from to 255 and add the pixels 
corresponding to the current intensity value to the image. 

4. Keep track of newly appearing, growing and merging connected components 
within the image. This is done using a union- find data structure [59] with 
quasi-linear computational complexitjj^of 0{na{n)) with n being the number 
of pixels. 

5. Growth rate of connected components is evaluated according to the stabil- 
ity criterion and if locally minimal the connected component is selected as 
maximally stable extremal region. 

Hence the algorithm runs in quasi-linear time with respect to the number of pixels. 



However in |2008||Nister and Stewenius] proposed a different algorithm that finds max- 
imally stable extremal regions in true linear time |17]. Unlike the above algorithm it 
does not resemble a rising waterfront continuously filling holes in the intensity height 



image but rather a flood- fill spreading across the image (see figure 3.10). Since only 
pixels on the boundary of the "flooding" are considered the algorithm does not rely on 
the union-find data structure as a simple priority queue is sufficient. 

Informally speaking the algorithm can be outlined as follows: Considering an image as 
a height map based on its intensities. Water is poured on an arbitrarily chosen point from 
which water starts flowing downhill. Once the lowest point of a sink is reached water 
starts filling up. The filling of a sink would correspond to the growing of a connected 
component. Whenever the water surface of the sink did not change notably in size after 
filling a certain amount of water this connected component can be considered stable 
according to the above definition. Once a sink is fully filled water starts pouring into 
adjacent sinks. The algorithm is finished after the whole image has been fiooded. 



^The complexity is reached when using union-find with compression and Hnking by size. a{n) repre- 
sents the inverse of the Ackermann function which is small for all practical n 



44 



3.2 Processing Pipeline 




(c) (d) 

Figure 3.10: Illustration of the linear time algorithm, (a) Water is poured at an arbitrary 
point and runs downhill until reaching a sink, (b) The sink is being filled 
with water, (c) Water flows over to a neighboring sink, (d) The neighboring 
sink is being filled with water. 

Linear-time algorithm 

The algorithm relies on the following data structures that describe the current 
state of execution: 

• A binary mask in order to mark the pixels that have already been visited 

• A priority queue of pixels on the outer boundary of the currently flooded 
image region. The queue is prioritized based on the pixels intensity values 
with lower intensities coming first. 

• A stack of components. A component corresponds to a sink that is currently 
being filled with water. Therefore a component is not always necessarily an 
extremal region. The topmost component on the stack corresponds to the 
currently considered component. Whenever a sink is filled and adjacent sinks 
exist, a new component is pushed onto the stack and the neighboring sink is 
explored. Therefore the maximum stack size is equal to the number of gray 



45 



3 Multi-Touch System 



levels in the image, i.e. at most 256. 
The algorithm proceeds as follows |17] . 

1. Choose an arbitrary starting point, make it the current pixel and mark it as 
visited. 

2. Initialize a new component with the current pixel's intensity and push it onto 
the stack. 

3. Examine all neighboring pixel of the current pixel and mark them as visited. 

• If the intensity is higher than the current pixel's intensity the neighboring 
pixel is added to the priority queue of boundary pixels. 

• If the intensity is lower the current pixel is added to the priority queue 
and the neighboring pixel is made the current pixel. Continue with [2j 

4. All neighboring pixel of the current pixel have either been visited or have 
higher intensity. Therefore accumulate the component with the current pixel. 

5. Get the next pixel from the priority queue. 

• If its intensity is equal to the current pixel's intensity, continue with [3] 

• If its intensity is higher than the current pixel's intensity, all components 
on the stack are merged until the intensity of the topmost component is 
greater or equal to the next pixel's intensity. Continue with[3j 

• If the queue is empty, the algorithm terminates. 

Modifications of the algorithm In order to accelerate the subsequent processing steps, 
the algorithm has been modified to gather additional information on extremal regions 
and their spatial structure during its execution. 

Extension to reveal all extremal regions 

In the default design the algorithm only reveals extremal regions that are maxi- 
mal according to the stability criterion. Hence a number of intermediate extremal 
regions are being discarded and do not appear in the component tree. However 
these additional regions might convey important information as to the spatial rela- 
tionship between regions that will be exploited extensively in the hand distinction 
processing step. Since this information is considered to be crucial, the algorithm 
design has been changed to identify all available extremal regions. Nonetheless 
the stability measure is still being computed for each region and might be used in 
further processing steps. 

Characterization of extremal regions using local descriptors 

In order to being able to use the identified regions for fingertip recognition a set of 
local descriptors will be used. Due to the incremental growth of components the 
following features can be efficiently computed during execution of the algorithm. 
Be R the region defined by the component in an image /: 
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Intensity statistics 

The mean and variance of the intensity distribution characterize the pixel's 
intensity values within a component. Generally, the mean fi and variance 
0"^ (the variance is represented as the squared standard deviation here) for a 
discrete signal are defined as follows: 



1 
1 

1 

"X 

1 



J2H^p,yp) (3.9) 



peR 



^{I{xp,yp) - fif (3.10) 



peR 



'^Hxp,ypY - l——^I{xp,yp)\ (3.11) 



peR 




5^/(xp,i/p)2-/i2 (3.12) 

peR 



The mean is simply represented by the arithmetic mean of all pixel intensities 
within a region while the variance can be described as the mean of squares 
minus the square of the mean. 

During the gradual growth of components through accumulation of new pixels 
and merging of existing components, these statistics can be be efficiently 
updated by keeping track of the following two equations: 

5*1 = ^I{xp,yp) (3.13) 
peR 

S2 = Y.I{xp,ypf (3.14) 
peR 



Whenever a pixel is added to the region the intensity and its square are added 



to the two equations. On merging the sums S^"^ and S\ 
from two regions i?'-^^ and i?'-^-' respectively are simply added one to another. 



:(2) 



and and 5*2 



(2) 



Based on equations 3.13 and 3.14 the mean /i and variance are calculated 
as follows: 



R 
R 



(3.15) 
(3.16) 



Image moments 

Image moments have been widely used in image processing algorithms for 
pattern recognition or object classification. A set of image moments can ac- 
curately describe an objects shape as well as a range of different geometric 
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properties [16]. Furthermore image moments can be used to form a set of 
shape descriptors that are invariant with respect to a set of image trans- 



formations. |Hu| was the first to use image moments in pattern recognition. 
In [30] he derived a still widely used set of descriptors that are invariant to 
translation, rotation and scale. 

Generally one distinguishes two types of moments in image processing, geo- 
metric and photometric moments. Geometric moments only take into account 
the shape information of a region while photometric moments include each 
pixel's intensity value as well. 

The basic definition of image moments is as follows [19j: 

Given a real function f{x, y) with finite non-zero integral, the two-dimensional 

moment of order (p + q) with p, g e N is defined as 

^pq = j y"a;V/(a;,l/)dxdy (3.17) 

Considering the discrete case and a region R for which the image moment is 
to be calculated, the above equation then becomes 



TTlpq 



J2x'y^fix,y) (3.18) 



The difference between geometric and photometric moments is in the defi- 
nition of the function /. In the case of geometric moments f{x,y) would 
be 

1 if (x, y) e R 
otherwise 



f{.x,y) 

while in the case of photometric moments f would be defined as 

f{x,y) = 



I{x,y) if(x,?/)ei? 
otherwise 

The following part will focus on geometric moments therefore / is considered 
to be defined accordingly. 

From mpq a certain number of characteristics may be derived, moo repre- 
sents the number of pixels within the considered region, i.e. its area, while 
Xc = mio/moo and yc = moi/moo define its center of mass (xc, yc)- The center 
of mass is the point where the whole mass of a region might be concentrated 
without changing its first moments [28]. As nipq is invariant neither to trans- 
lation, rotation nor scale, it will be called a raw moment. In contrast central 
moments are invariant to translation and defined as 

/ip, = Y.(x- x,ny - y,yf{x, y) (3.19) 



48 



3.2 Processing Pipeline 



Central moments have the following characteristics: 

A^OO = ''^00; f^lO = 1^01 = 0, 

/i2o and /io2 represent the variance about the center of mass, while the covari- 
ance measure is defined by /in [IB]. 

Besides using the previously described equation, central moments can also be 
calculated directly from raw moments: 

'"^'^ = E E (l) f ^) i-^cY'-'\-ycY'-'^^k, (3.20) 



Central moments can be made scale invariant through normalization, there- 
fore normalized moments can be derived as follows: 



PI ,(p+q+2)/2 



(3.21) 



Having achieved translation and scale invariance so far, it was Hu who first 
published shape descriptors that were also rotation invariant. The 7 shape 
descriptors as proposed by |Hu| are as follows: 



+ 



+ 



^20 + ^02 

{j^20 + ^02)"^+ 4:Uf^ 

(z/30 - 3z/i2)^ + (3Z/21 - Z/os)^ 

(«^30 + 1^12)^ + {1^21 + l^Ozf 

(z/30 - 3z/i2)(z/30 + 1^12) [(^^30 + 1^12^ " K^21 + ^^03)^] 

(3z/2i - z/03)(z/2i + I'm) [3(i^30 + i^i2f - {1^21 + ^^03)^] 

(^^20 + 1^02) [(^^30 + 1^12^ - 3(Z/21 + ^^03)^] 
4z^ll(z^30 + 1^12) (^^21 + 

(3z/2i - Z/03)('^30 + J^u) [(^^30 + ^^12)^ - 3(z/21 + I^os)^] 

(z/30 - 3z/i2)(z^2i + z^os) [3(z^30 + i^uY - {1^21 + ^^03)^] 



Bounding volumes 

Image moment based shape descriptors provide a performant but unintuitive 
way of describing a region's shape, however it is sometimes useful to find a 
simpler representation based on geometric primitives. The simplest form of 
such a representation is an axis-aligned rectangle enclosing all pixels belonging 
to a region. Though being a very rough approximation it is well suited to 
locate a region in an image with respect to other regions. Such a rectangle 
with minimal area is called hounding box. 

A severe drawback of an axis-aligned bounding box is its insensitivity to ob- 
ject orientation. The area of a bounding box might increase through object 
rotation, although the actual area of the object did not change. Therefore 
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another shape representation using an oriented elhpse is presented. Different 
approaches to compute such an elhpse exist, however optimal techniques are 
computationally expensive. Nonetheless a non-optimal approximation can be 
easily derived using image moments. Given a region R with geometric mo- 
ments moQ,mio,moi, mil, 17120,^02 its bounding ellipse can be computed as 
follows [ill52]. 

Since the resulting bounding ellipse conserves the source energy, i.e. has 
the same zeroth, first and second order moments, the centroid (xc, yc) of the 
enclosing ellipse is equal to the center of mass of R: 

miQ moi 

Xc = and yc = (3.22) 

moo "^00 

The orientation 6, semi-minor axis h and semi-major axis w can be computed 
by 

9 = - ■ tan-i ' 



2 V a — c 



h = J2- (^a + c- ^yb^ + {a-cy 



2 ■ (a + c + ^/WT 



w = \I2- \a + c+ \/0^ + (a- c)^ 



with 



/^20 ^20 2 

— = 

moo "^00 



2. ifn 



/ mil \ 
■ Xcyc 



moo 

f^02 mo2 2 

c = = yi 

moo moo 

with /i20;/^02 and /in being the second order central moments of R. 



3.2.3 Fingertip Detection 

The component tree resulting from the MSER processing is at the basis of this and 
the following processing steps. This tree structure has two important properties with 
respect to its contained extremal regions: 

• The darkest and at the same time largest extremal region forms the root of the 
component tree. 

• For all child-parent relationships holds the property that the child is smaller in 
size and has brighter intensity than its parent. 

As fingertips in contact with the surface create the brightest spots in an image it follows 
that only leaf nodes of the component tree can be considered fingertip candidates. 



These fingertip candidates are evaluated as follows: 
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1. Identify the extremal regions that represent the finger of a fingertip candidate. 
This apphes to all parent regions of the candidate up to a maximum size and as 
long these do not have siblings, i.e. the candidate region is the only leaf contained 
in the parent's subtree. The largest of these regions will subsequently be referred 
to as finger. 

2. For each candidate region a feature vector is compiled containing the following 
properties: 

• Bounding ellipse of candidate region (major and minor axis). 

• Bounding ellipse of finger region (major and minor axis). 

• The depth of the subtree having the finger region as its root node. 

• The maximum relative growth in pixel size of any child-parent relationship 
between the candidate and the finger region. 

• The relative growth in pixel size of the finger region with respect to the 
candidate region. 

• The ratio of the intensity ranges of the candidate and finger regions respec- 
tively. 

• The maximum value of the first of the seven shape descriptors introduced by 



Hu for all regions below the finger region. 



• The number of pixels within a defined distance that have an intensity of less 
then 10% of the candidate region's mean intensity. This feature is computed 
during this step for candidate regions only and is based on the observation 



that fingertips appear as spikes in the intensity height image (see figure 3.6) 
hence resulting in a significant intensity fall-off in vicinity. 



3. Calculate a confidence score C using the weighted sum of the feature vector: 

C = J2wi-fi (3.23) 

fi 

The weights have been chosen such that dimension differences between features 
are compensated without giving any of the features a too high influence on the 
flnal score. 



4. Two thresholds have been deflned on the confidence score to classify candidate 
regions either as no confidence, low confidence or high confidence. However having 
a high confidence score in a single frame is not sufficient for a region to be regarded 
as a fingertip. A candidate region must achieve in three consecutive frames at least 
once a high confidence and never a no confidence score in order to be classified as 
an actual fingertip. 
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3.2.4 Hand Distinction 

In order to establish the finger-hand relationships of identified contact points several 
approaches have been proposed in the literature. For instance [10] and [61] use the ori- 
entation of the contact area's bounding ellipse to infer the relationship between hands 
and fingers. That feature however is highly dependent on the finger posture. While an 
oblique touch yields a bounding ellipse with high eccentricity, hence the orientation can 
be inferred with high confidence, a vertical touch in contrast results in a more circular 
shaped ellipse which does not provide useful orientation information [61]. The assump- 
tion that the user always touches the surface at an oblique angle however requires prior 
knowledge by the user and might prove confusing for novice users. Another approach 



proposed by |Dohse et al.| is to use an overhead camera to group fingertips. Although 
that approach works reasonably well it requires an external camera hence thwarting the 
compactness of the setup while adding a significant additional processing overhead to 
the pipeline [T5] . 

In this approach however the grouping can be robustly established using solely the 
information from the previous processing steps, that is the component tree and the 
identified contact points that to a given degree of confidence correspond to fingertips. 
These fingertips might or might not belong to the same hand or even the same user. 
As the respective extremal regions are leafs in the component tree they are contained 
in a number of regions of higher level until all of them are contained in the root region. 
Hence there already exists a spatial clustering of these regions based on intensity values. 
Therefore fingertips could simply be grouped gradually while ascending in the component 
tree until a homogeneity criterion such as the maximum distance between fingertips 
or the maximum size of the parent region is violated. While that approach generally 
works reasonably well problems arise under certain conditions. In the case of excessive 
ambient light such as from stray sun light the amount of contrast in the camera image is 
heavily reduced. However with reduced contrast less extremal regions are being revealed 
resulting in a sparse component tree. That loss of spatial information might lead to an 
erroneous clustering or a dead-lock situation when too many fingertips are child of the 
same region. 

Therefore a more robust approach will be presented here that combines the advantages 
of the idea described above with agglomerative hierarchical clustering. Hierarchical 
clustering itself would be unsuited in our scenario because of its high computational 
complexit}]^ Tabletop surfaces continually grow in size hence enabling the simultaneous 
interaction of multiple users. However due the large number of resulting touch points 
unaided nearest neighbor clustering would be highly inefficient for real-time application. 

The idea is to only cluster a small subset of touch points at a time using agglomerative 
hierarchical clustering hereby reducing the impact of the polynomial complexity. The 
spatial clustering provided by the component tree, although erroneous at times, can 
serve as a reasonable input. The outline of the algorithm is as follows: 



^Generally its computational complexity is bound by 0{n^), however in certain cases a bound of 0{'n?) 
can be achieved[TTl [SH] 
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1. Traverse the component tree in postfix order. This ordering ensures that all chil- 
dren have been processed before the node itself is being processed. For each node 
do: 

• If the node is a leaf node and has been classified with at least low confidence 
as fingertip, create a new cluster containing this node. 

• Otherwise create the current set of clusters from all child nodes and process 
these as follows: 

a) Compute the distance matrix between all pairs of clusters of this node 
using a distance function d{Ci, C2). 

b) Find the pair of clusters Ci, C2 with lowest distance dc{Ci, C2) that still 
satisfies the distance criterion ViCi, C2, dc). 

— If such a pair of clusters exists, merge the two and continue with la 

— Otherwise the clustering is finished. 

Different distance functions such as single-link (nearest neighbor) or complete-link 
(furthest neighbor) exist. Single link is defined by 

dc{Ci,C2)= min d{u,v) (3.24) 



while complete-link uses 



dc{Ci,C2) = max d{u,v) (3.25) 

u£Ci,v£C2 



with d being a distance measure. Here Euclidean distance is used as distance measure. 

Single link would not be appropriate as a cluster could in certain cases grow too large 
hereby including fingertips from a different hand in proximity. Complete link however 
considers the distance to the fingertip that is furthest away. Hence combining complete 
link with an appropriate distance criterion T) that places a constraint on cluster growth 
has been found to work best to cluster fingertips into hands. 

V is set to the maximum expected distance between any two fingertips of the same 
hand. The maximum distance is usually the one between the thumb and the little finger. 
However as user interaction using only these two fingers is rather rare the criterion has 
been restricted further based on the combined number of fingertips in the clusters. The 
lower the number of fingertips contained in both clusters the smaller is the maximum 
allowed distance. Hence fingertips that are at an unusually large distance from each 
other will only be merged once fingertips in intermediate positions have been found. 
Therefore this extension makes the criterion more robust as it alleviates negative effects 
of the chosen maximum hand distance being too large or too small. 

3.2.5 Hand and Fingertip Registration 

The registration of hands and fingertips in optical tabletop displays has recently received 
increased interest in the literature although the amount of research still remains limited. 
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Walther- Franks et al.| use a decision tree to classify fingertip configurations and are 
currently the only ones to provide fingertip registration with less than five fingers being 
present. However their approach can only be considered a starting point for further 
research as the accuracy of around 80% is still too low for productive use. One reason 
for the lack of accuracy is the anatomic variation of hands between humans. Although 
there are constraints on finger orientation and position, configurations differ highly based 
on hand and finger size. Therefore current approaches require the presence of five fingers 
in order to perform fingertip registration. To meet that requirement one could imagine 
the presence of five fingers being the trigger to display a menu aligned around the hand 
as proposed in [3j and [13]. The hand registration would for instance allow the display 
of different menus for the dominant and non-dominant hand. 

Similarly to the above approaches the registration process comprises the following 



steps which are also visualized in figure 3.11 



1. Order the five fingertips along the hand such that either the thumb or the little 
finger comes first. 

2. Identify the thumb from the set of fingertips. The fingertip registration of the 
remaining fingertips follows from the previous ordering. 

3. Infer from the fingertip registration whether the considered fingertips are from a 
left or right hand. 

The first step is crucial as an erroneous ordering can not be corrected at later stages of 
the registration process. Au and Tai for instance propose an angular ordering around the 



centroid of the five fingertips pj. However the centroid usually deviates too much from 
the perceived center position as the index, middle, ring and little finger are generally 
located very closely to each other. While this approach is robust if fingertips are aligned 
as an arc on the surface, considering one of the fingers being in movement such as the 
index finger performing a sliding gesture downwards the ordering might swap thumb 
and index finger. Hence another ordering will be proposed here based on the shortest 
distance between fingertips. This ordering is similar to the one described in [IS] which 
however was not known to the author at the time of elaboration as the mentioned paper 
was published only after the completion of the development of this prototype. The 
ordering is computed as follows: 

1. Be J-" the set of all fingertips /j. The set V is defined as the distances between all 
possible pairs of fingertips: 

{Wi, /,),/.,/,} I {fiJj)eFxF} 
with d{fi, fj) denoting the Euclidean distance between /j and fj. 

2. For each fingertip fi assign a unique label L{fi). 

3. Iterate V in increasing order of d{fi,fj). Be C the set of all edges forming the 
fingertip contour. For each pair (/j, fj) do 
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(a) Initial fingertip positions of a liand in 
resting position. 



(b) Fingertip ordering along shortest 
path. 





(c) Endpoints are known to be either 
thumb or little finger. 



(d) Identify thumb as the finger furthest 
away from the centroid of all five fin- 
gertips 




(e) Fingertips can now all be identified 
given the thumb (T) and the order- 
ing of fingertips. 
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Figure 3.11: Steps of the fingertip registration process. The hand shape and fingertip 
positions used in the drawing have been copied from an actual camera 
screenshot from the prototype. 
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. If L(/,) ^ L{f,) 



a) Add toC. 

b) Assign a common unique label to all fingertips fk that either fulfill L[fk) = 
L{fi) or L(/fc) = 

• If I C 1= 4: 

a) All fingertips have been included in the contour, (/j, fj) denotes the pair 
of fingertips on the contour that are furthest apart from each other, /j 
and fj define the endpoints of the contour which by definition of the 
ordering correspond either to the thumb or the little finger. 

Obviously the above ordering relies on the assumption that on the contour thumb and 
little finger are the two fingertips that are furthest apart from each other. Since a rather 
unnatural hand pose would be required to violate that property it was deemed a valid 
assumption for regular use. 

Therefore the next step in the registration process is to uniquely identify the thumb. 
As the previous ordering already reduced the number of candidate fingertips, the task is 
equivalent to distinguish thumb and little finger. Since the index, middle, ring and little 
finger influence the position of the centroid as mentioned above, thumb and little finger 
can be distinguished with respect to their distance to the centroid. The little finger 
is the one located closer to the centroid while the thumb is the one positioned further 
away. 

Finally the classification of the set of fingertips as belonging to the left or right hand 



remains to be done (see figure pn2| . Be /r, //, /m, fn and the thumb, index, middle. 



rin^ and little finger respectively. Based on the angle between the the vectors /t/l and 

frfi, left and right hand can be easily distinguished. If that angle is smaller than 180° 

the set of fingertips is classified as right hand, otherwise as left hand. In order to make 

the classification more robust in presence of finger movement such as the aforementioned 

y — y y y 

sliding down gestu re of the index finger, the vector frfiMR = frfi + ItIm + ItIr will 

be used instead of fxfi only. 

Hence the classification is defined as follows: 



I /t/l X fxfiMR I = I /t/l I • I frfiMR \ ■ sm 9 < ^ Left Hand , . 

^ ^ ^ ^ (o.ib) 

I /t/l X frfiMR I = I /t/l I ■ I hfiMR I • sm 6* > ^ Right Hand 

where x denotes the cross product of two vectors. 



3.2.6 Tracking 

The tracking step aims to establish intra-frame correspondences of the detected finger- 
tips and similarly to the default processing pipeline a two step approach is chosen here. 
First for all previously detected fingertips the estimated position is predicted in the cur- 
rent frame in order to simplify matching. However instead of the Kalman filter which 
is computationally demanding the much faster and equally accurate double exponential 
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(a) 



(b) 



Figure 3.12: Hand classification using the angle between little finger and the remaining 
fingertips with respect to the thumb. 



smoothing algorithm will be used as a motion model. Second using the predicted posi- 
tions of previous blobs in the current frame, relationships are established using nearest 
neighbor matching. However given the previous clustering of fingertips this step can 
be significantly accelerated. Instead of considering all possible combinations of finger- 
tips between the two frames, this number can be narrowed down by only considering 
combinations of fingertips whose containing cluster intersect. In order to test for an 
intersection the bounding boxes of the two clusters have been used. 

In the following section the double exponential smoothing algorithm will be outlined. 
However as the following nearest neighbor matching is basically the same as in the 



default processing pipehne, the reader is referred to page 23 for an in-depth description 



Double Exponential Smoothing 



The double exponential smoothing algorithm as proposed by LaViola 



in is based on 

the assumption that movements can be accurately approximated using linear equations. 
Obviously this only holds true for short prediction intervals. However as multi-touch 
applications require a high update frequency for the sake of interactivity, this does not 
appear to be an actual constraint. Given this assumption a simple yet performant 
algorithm design can be derived. 

For the approximation of the y-intercept and the slope of the linear movement equation 
two smoothing statistics are defined. These are calculated based on an exponential 
weighting of movement positions giving recent positions a higher weight compared to 
older ones. The degree of exponential decay is determined by a parameter a G [0, 1]. 
Be Pt the position vector at time t, therefore the smoothing statistics S^^ and S^^ are 
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calculated as follows: 



Spt = apt + (1 - a)S. 

sS = «4 + (1 - 



Pt-1 



[2] 



Pt-1 



(3.27) 
(3.28) 



Equation |3.27| calculates an exponentially weighted position value based on previous 

Based 



positions. This calculation is then again smoothed exponentially in equation 3.28 



on those two values an approximation of the linear movement equation can be derived 
allowing a position prediction at time t + X. For a detailed algebraic calculation the 
interested reader is referred to |39l. It follows: 



Pt+x 



a 



1 — a 



A 



1 



a 



1 



a 



A -5 



■[2] 
Pt 



(3.29) 



In order to achieve a correct prediction the algorithm assumes that movement positions 
are captured at equal intervals. Hence the algorithm is only able to predict at times 
X = k ■ T with A; G N. Since predictions are only required for camera frames which are 
delivered at a fixed rate, this is no constraint in our usage scenario. 

In [39] the algorithm has been compared to the widely used Kalman filter in terms of 
efficiency and accuracy. It showed that the prediction of human movements was possible 
with almost equal accuracy while outperforming the Kalman filter by factor 135. 



3.2.7 Multi-Touch Abstraction Layer 

Generally it is desirable to separate the multi touch sensing technology and the actual 
multi touch application using a simple abstraction layer. Although this introduces addi- 
tional latency to the processing pipeline it would allow the distribution of computation 
across a network or the usage of a different programming framework for application de- 
velopment. The latter is especially important as application developers from a design 
background often use specialized high-level frameworks such as Processing, Pure Data 
or Flash. The TUIO protocol is a network-based abstraction layer originating from the 
reacTIVision toolkit that has been widely adopted in the multi touch research and open 
source community since its initial publication in 2005 |35j . 



TUIO The TUIO protocol has been developed to provide a simple way of distribut- 
ing touch states from the reac Table without introducing significant additional latency. 
Although it was tailored to suit the needs of the original project it got widely adopted 
because it supported the basic primitives, i.e. touch points (called cursors) and mark- 
ers (called objects. The specification of markers will not be included here, see [3S] for 
a detailed description). In order to achieve a low-latency transmission the following 
characteristics are notable: 

• TUIO messages are based on Open Sound Control (OSC) a syntax widely used to 
encode control messages for musical instruments [35]. OSC messages are defined as 
a tuple containing the message profile, the command, the argument type signature 
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and the argument array. Several messages can be included in a bundle to reduce 
the number of transmitted packets. 

• TUIO messages/bundles are transmitted over the network using UDP. UDP pro- 
vides fast transmission at the lack of state information, i.e. the sender is unaware 
whether the packet actually arrived at the destination. 

• In order to ensure proper functioning using an error-prone transmission channel, 
TUIO communication is based on touch states instead of touch events. 

The TUIO update message for cursors is defined as 

/tuio/2Dcur set sid xpos ypos xvel yvel maccel 

containing as arguments the session id, position, velocity and acceleration. The profile 
also exists for the 2.5D (containing the height above the surface) and 3D (containing 3D 
position and rotation information) cases, see [SB] for a detailed description. 
The TUIO protocol furthermore includes the alive states for cursors: 

/tuio/2Dcur alive [active session IDs] 

The session IDs of all active cursors must be included in this messages during their 
whole lifetime regardless of whether their state has been previously updated using the 
aforementioned messages. 

In case messages do not arrive in orderly fashion at the destination, messages can be 
reordered or dropped using the additional frame sequence id: 

/tuio/2Dcur fseq [int32] 

Custom TUIO profiles Although the TUIO protocol provides profiles to efficiently 
represent single touch points it lacks support for hands. However the protocol allows 
the inclusion of custom commands that will be considered if supported by the receiving 
application or simply ignored otherwise. Hence the following profiles are to be considered 
complementary to the above described standard set of messages. 
The custom hand profile defines two messages: 

/custom/_hand set sid type posx posy width height [array of 2Dcur sids] 
[array of finger types in the same order as previous sids] 

/custom/_hand alive [active session IDs] 

The former updates the hand parameters, that is the hand type ("left", "right", "un- 
known"), its bounding box ((posx, posy) defines the center of a rectangle with edge 
lengths width and height respectively) and the touch points associated with this hand. 
Additionally for each touch point the finger registration is included (t - thumb, i - index 
finger, m - middle finger, r - ring finger, 1 - little finger, u - unknown). The latter message 
works similarly to the alive messages defined above. 
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3.3 Implementation 

During the work on the tabletop prototype an image processing framework has been de- 
veloped that implements the proposed approach to multi-touch processing. The frame- 
work has a number of notable characteristics: 

Performance Performance might seem like an obvious requirement for real-time appli- 
cations however usually poses a real challenge. At best its computational activity 
goes unnoticed to the user and multi-touch interaction is highly responsive. In 
the worst case it introduces latencies which interrupt user interaction at a level 
to making it unusable. Hence the framework was designed to make the most of 
todays multi-core CPUs and therefore relies highly on multi-threading wherever 
possible and appropriate. However threading also requires synchronization which 
in turn results in a computational overhead. In image processing frameworks such 
as OpenCV synchronization usually takes place after each image processing op- 
eration. However given the number of processing steps in the proposed pipeline 
this might add up to a significant overhead. Thus image processing operations are 
bundled into smaller processing pipehnes that will be executed without synchro- 
nization. 

Interoperability In order to being able to use this framework with other applications, 
also those that have not explicitly been developed for this framework, the TUIO 
protocol has been chosen as a means of communication. The TUIO protocol has 
established itself as a de facto standard in the open-source and most parts of the 
research community. The framework implements the TUIO vl.l specificatioE|^that 
is understood by most current applications hence allows the communication of fin- 
gertip positions. However the proposed extended commands which are transmitted 
alongside the common command set will only be interpreted by those applications 
having been developed for this framework. 

Extensibility The framework has been developed as a proof-of-concept of the proposed 
approach, hence the implemented image processing methods mainly reflect those 
described previously. However all image processing operations are based on a 
common base class that enables their easy integration in the processing pipeline. 
Thus, the processing pipeline can be extended to include any number of additional 
functionality without any modifications to the core components. 

Platform Independence Although this framework has been solely developed on a ma- 
chine running Ubuntu|^ Linux in version 10.04, platform independence is an im- 
portant criteria to ensure adoption of the framework by a wider public. Hence 
all accesses to hardware or operating system (OS) resources have been wrapped 
by dedicated libraries that ensure proper operation on a variety of hardware and 



operating system configurations. For instance, the boost libraries have been used 



"see 

g 

^see 
lOsee 



http://www.tuio . org/?specif ication 



www . ubuntu . com 



http: //www. boost . org 
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for access to the file system and threading capabilities of the host OS. Furthermore 
for accessing cameras based on the ieeel39^^ standard the dcl39^^ library has 
been employed, while the widely used computer vision library OpenC\ ^ has been 
used to implement some image processing features. 

Additional credits go to Daniel Fischer who developed a similar framework for the pre- 
vious prototype at the Virtual Reality Systems Group. The provided source code for 
camera frame acquisition, image rectification and a control interface based on OpenGL 
have been thankfully reused in this framework. 



3.3.1 Image Processing 

Image processing has been implemented similarly to other frameworks with one notable 
exception. Unlike other frameworks, where image operators usually process the entire 
image, all image processing functionality has been implemented to being able to op- 
erate on both, the entire image or just a smaller, predefined area of the image. This 
is especially useful with operations which apply special processing to border cases. For 



instance the connected components algorithm described on page [37] uses previously com- 
puted values to determine the value of the current pixel. This however requires pixels 
on the image border to be treated differently from pixels inside the image increasing 
the number of checks during operation. For all pixels contained in the first row and 
the first column of the image these previous values just do not exist as they lie outside 
the image area. However to streamline the algorithm implementation and to reduce the 
number of required checks, the algorithm processes an extended image which has a one 
pixel border initialized with a default value added around the actual image data. While 
traditional implementations would require copying image data to the extended image 
at each frame, results of previous operations can be written directly into the extended 
image using the capabilities of this framework. As the border pixels are only required 
for reading and are never written, these only need to be initialized once at startup. 



3.3.2 Multi-Threading 



As has been previously shown in figure 3.4 on page 33, the processing pipeline provides 
the possibility of concurrent execution during two stages: 



Preprocessing As all preprocessing steps starting from distortion correction until region 
of interest detection operate on a per-pixel basis, they can be considered a single 
combined operator when it comes to threading. Thus the image processing area is 
partitioned between the number of available threads and the whole preprocessing 
pipeline is executed entirely without synchronization. 



^^see jhttp : / /www . 1394ta . org/| and ,http : / /deunien . douxcheunps . net/ieeel394/libdcl394/iidc/ 

for the actual specifications 



12 



see http : //damien . douxchsmips .net/ ieeel394/libdcl394/ 
^^see ^http : / / opencv . willowgarage . com/^ 



61 



3 Multi-Touch System 



Region of Interest Processing After the detection of regions of interest in the image 

the threading strategy changes. Regions of interest are each processed in a single 
thread while the processing of several regions of interest is distributed among all 
available threads in decreasing region size. Hence the impact of threading depends 
on the number of regions of interest and their size. 

The framework provides specialized classes that allow the easy implementation of this 
functionality. By chaining a number of image operations in ProcessSets a processing 
pipeline can be defined that runs on a single thread. However ProcessSets can be grouped 
again and be marked for parallel execution enabling the execution of completely different 
pipelines on different threads. In order to provide simple means for synchronization 
execution barriers can be defined until either all or a predefined ProcessSet has finished. 
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In this chapter the previously presented processing pipehne is being tested in terms 
of performance and detection accuracy. Performance is obviously a crucial aspect of a 
real-time application as high latencies will be immediately felt by the user and impact 
usability. However detection accuracy might be considered even more important due 
to the high disruptive potential that erroneously identified touches might have on user 
interaction. 

Given the importance of these two factors, the presented evaluation aims to test the 
limits of the proposed processing pipeline by including as many simultaneous users as 
possible in the interaction. As to the maximum number of people that might comfort- 
ably perform simultaneous two-handed interaction on the multi-touch prototype without 
getting too much in their way, four people each of them standing on one side of the table 
has been found to work best. Hence, four students (all male, aged 24 - 27) from the 
Virtual Reahty Systems Group at the Bauhaus University Weimar were asked to take 
part in the evaluation, resulting in a maximum number of 40 simultaneous finger touches 
(and possibly even more evaluated touch points considering that users were allowed to 
rest their palm on the surface). 

Since no multi-touch capable user interface had been developed at this point, the 
participants were asked to mimic multi-touch gestures on the blank surface. However 
they were reminded to not only use common gestures involving a small number of si- 
multaneous touches but to also use both hands with randomly changing configurations 
of fingers touching the surface. Due to the lack of a real application participants tended 
to perform rather swift and abrupt movements though. While that is considered here 
as a challenge for the processing pipeline it obviously docs not reflect real-world usage. 
We assume that given a calmer user interaction the measured accuracy might be even 
better. 

In the end more than 4000 camera frames were captured during the interaction process 
that now could be replayed to the processing pipeline in order to properly measure 
execution times and detection accuracy. 

4.1 Performance 

As performance is dependent on the number of simultaneous user actions performed on 
the surface, projects like the Microsoft Surface define a maximum threshold on simultane- 
ous touches to avoid exceeding real-time performance. However as surfaces continuously 
grow larger and larger in size, processing pipelines have to deal with an ever increas- 
ing number of simultaneous touches. Therefore evaluating the detection performance 
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for up to 40 simultaneous fingers touching the surface is being considered a well-suited 
performance indicator of the processing pipeline. 

Measurements have been taken on a machine featuring a Intel Core 17 CPU940 
(2.93GHz, 4 cores, 8 logical processors using Hyper-Threading and 5.8GB of mem- 
ory running Ubuntu Linux in version 10.04. The camera delivered frames at a rate of 
60 fps at a resolution of 640 x 480. The measured execution time includes all steps from 
the pipeline apart from image acquisition. Measuring was started just after the image 
had been loaded into memory and stopped when all processing and memory clean-up 
steps had been executed. In order to minimize the influence of external factors such as 
the operating system, measurements have been averaged over 5 processing runs of the 
captured user interaction. 

Given the measurements shown in figure |4.1| the most important result is that even 
for single threaded execution a processing performance of at least 60 frames per second 
is achieved. Most notably that performance was achieved even during the simultaneous 
interaction of 4 users, 8 hands and up to 40 fingers. Furthermore using four processing 
threads processing time is capped at around 10ms regardless of how many fingers of the 
8 hands simultaneously touch the surface. 

From the measurements it becomes obvious that performance is highly dependent on 
the way users interact with the surface. While 4 users each touching the surface with 
one finger would benefit from threading in the processing stage, one user placing four 
fingers of the same hand onto the tabletop would not be able to benefit from threading. 
Furthermore processing performance is determined by the largest and most complex (in 
terms of extremal regions) region of interest. Although the processing pipeline assigns 
regions of interest to threads in decreasing size, a region of interest that deviates too 
much in terms of size and complexity could in the end negatively impact performance 
as other threads would be left in idle state waiting for that region to be processed. 

In the interaction process on which this evaluation is based, participants were asked 
at the beginning to place their hands on the surface one after another. At a later time 
of the interaction process rarely less than 10 simultaneous touches occurred and usually 
all 8 hands remained above the surface, hence measurements for less than 10 fingertips 
are largely based on frames where only one or two regions of interest were visible. Thus 
the impact of threading is limited in this case. Hence the performance progression with 
respect to fingertips should only be seen as an indicator of how performance could evolve. 

Albeit the performance measurements being very satisfying it is only a part of the 
processing that is performed between a user action and the corresponding visual feed- 
back. However it is that latency that in the end determines the users impression of 
responsiveness of a prototype. The contributing factors to the latency are as follows: 

• Since internal camera operation time is negligible the major amount of latency 
results from the exposure time as the camera image is not created at a single 
instant in time but rather results from the integration of incoming light over a 
period of time. 

^see http : //www . intel . com/content/www/us/en/architecture-axid-technology/ 

hyper-threading/hyper-threading-technology .html| 
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Evaluated Fingertips 

Figure 4.1: Execution times with respect to the simultaneous number of finger touches 
for different numbers of threads involved in the processing. 



• Next the transmission of image data from the camera to the computer adds further 
latency. The camera used in the prototype transfers data at 400Mb/s. However 
IEEE 1394 devices send data packets at a fixed cycle rate hence it is actually the 
number of packets that determines the transmission time. According to the IIDC 
specificatiorj^ the image mode used in this prototype (resolution: 640x480, pixel 
bit depth: 8, frame rate: 60) corresponds to a packet size of 2560 pixels. Thus 
640 * 480/2560 = 120 packets need to be send in order to transmit a complete 
camera frame. The camera has a cycle rate of 125/is hence transmission time adds 
another 120 * 125/is = 15ms to the latency. 

• The processing pipeline is the next important contributor to latency as can be 
seen from the measurements. Furthermore latency can temporarily vary due to 
unfavorable process scheduling from the operating system. 

• The TUIO protocol used to communicate detection results adds further latency 
due to the required network transport. Although it relies on UDP to transfer 
data, which unlike TCP operates connection-less and therefore is faster, it still 
adds some milliseconds to the latency. 

^IIDC (Instrumentation and Industrial Control Working Group) is the working group responsible for 
the specification. The specification is sometimes also known as Digital Camera (DCAM) specifica- 
tion. See http://daiiiien.douxchEmips.net/ieeel394/libdcl394/iidc/. 
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• Finally the multi-touch application has received user input from the touchscreen 
although it still needs to compute a response to the input and update the user 
interface accordingly. Although this part is at the responsibility of the application 
developer it still needs to be taken into account that a major chunk of processing 
is still to follow. 

Hence the latency introduced by the multi-touch sensing technology of the prototype 
can be summed up as follows: 

• Camera Image Acquisition: 16ms 

• Camera Image Transfer: 15ms 

• Processing Pipehne: 10ms (with 4 threads) 

The impact of the exposure time could be further reduced by pulsing the infrared light. 
Since the light's intensity would be much higher during the pulse the exposure time could 
be significantly shorter. While this would furthermore reduce the influence of ambient 
illumination, it would require additional logic to synchronize the illuminators and the 
camera. Moreover the camera transfer rate could be easily halved by using an IEEE 
1394 compliant camera that supports the faster data transfer rate of 800Mb/s. Given 
these possibihties to further shorten the sensing latency with additional hardware, the 
processing pipeline seems to be well suited performance- wise for applications in real-time 
scenarios. 

4.2 Accuracy 

The evaluation of accuracy is especially important due to the high disruptive potential 
that erroneous sensing results might have on usability. On the one hand wrongly iden- 
tified touches might lead to unintended application behavior leaving the user puzzled 
about what he has done wrong. On the other hand missing an intentional touch can lead 
to increased user frustration as the user might need to retry several times to perform an 
action. 

However measuring accuracy is particularly difficult as one first needs to establish a 
ground truth that later serves as a basis for comparison. In this case the first 1500 
frames from the interaction process have been considered. Given the extensive manual 
work required to label all visible fingertips in these camera images, this number has been 
further reduced to only consider every fifth frame. Hence a total of 300 camera images 
have been analyzed that contained on average more than 21 visible fingertips. 

In order to quantify the accuracy two basic measures will be used here. First will 
be considered how many of the visible fingertips have been correctly identified by the 
pipeline. This measure is usually called the hit rate. Secondly, given the total number of 
identified fingertips, how many of these have been wrongly considered to be in contact 
with the surface. This second measure is usually referred to as false positive rate. The 
measured accuracy is as follows: 
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Hit Rate 

Correctly identified fingertips 6449 

— = 0.973 

Total number of visible fingertips 6628 

False Positive Rate 

Wrongly identified fingertips 88 



Total number of identified fingertips 6537 



0.0135 



While the hit rate is very satisfactory and shows the potential of the proposed pipeline it 
is the false positive rate that might still need improvement. Given the constraints from 
the prototype in particular the significantly uneven illumination, this value appears to 
be still reasonably low though. Hence these two measures serve as an indication of 
the accuracy of this prototype but would require a re-evaluation in the future since the 
current data is based on a somewhat artificial user interaction as no proper user interface 
was present at the time. 

Moreover the accuracy of the hand distinction has been analyzed as well. For this 
purpose more than 450 camera images have been manually classified which have been 
extracted from the interaction process at five frame intervals. The evaluation measured 
the precision of the hand distinction process, that is the percentage of hands that have 
been clustered correctly. Hand distinction was regarded as successful if all detected 
fingertips from the same hand were attributed to the same cluster. However if these 
fingertips were contained in more than one cluster the clustering for this hand was 
considered invalid. Furthermore if even two hands were erroneously contained in the 
same cluster both were regarded as unsuccessful. The measured accuracy is as follows: 

Precision 

Correctly clustered hands 2681 

= « 0.922 

Total number of visible hands 2909 

The hand distinction usually failed in the presence of false positives from the fingertip 
detection. Since cluster size was capped at five fingers, false positives resulted in sur- 
passing that limit and hence interrupted the merging of clusters. Thus reducing the 
false positive rate of the fingertip detection step would have positive effects on hand 
distinction in terms of precision. 

The accuracy of the hand and fingertip classification has only informally been tested. 
In total eight students, all male aged in their mid-twenties, from the Virtual Reality 
Systems Group at the Bauhaus University Weimar had been asked to place both of their 
hands on the surface. While no restrictions were imposed on the exact hand posture, the 
students were asked to adopt a position that felt natural to them. In these short tests 
the hand and fingertips have been registered correctly for all participants. Hence the 
assumptions used to infer these properties can be considered justified. Although only a 
static hand position was tested here, it currently seems to be the most convincing use 
case where a gesture is started in this position allowing a proper registration. 
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In this thesis a novel processing pipeline for optical multi-touch surfaces has been pre- 
sented motivated by the perceived lack of viable processing alternatives to enable real 
multi-hand and multi-user interaction. While the common approach usually relies on 
thresholding to discern surface contacts, this novel approach is centered around the con- 
cept of extremal regions which in contrast only describe a relative relationship between 
an image region and its border. Depending solely on this relative relationship these 
regions are much more robust in the presence of non-uniform illumination. Furthermore 
extremal regions can be organized in a hierarchical structure hereby revealing the spatial 
relationship between surface contacts. In order to efficiently compute these structures 
the processing pipeline relies on the Maximally Stable Extremal Regions algorithm. Al- 
though it has been successfully used in stereo image matching and visual tracking, this 
algorithm has never before been employed for multi-touch sensing. Given that hier- 
archical structure of extremal regions, a method has been described to reveal all those 
regions that correspond to a touch by a fingertip based on a set of image features. Based 
on these fingertips a second algorithm enables the attribution of fingertips to different 
hands based on a approach combining the hierarchical structure as well as agglomer- 
ative hierarchical clustering. Finally a method has been presented that provides hand 
and fingertip registration for these clusters of fingertips. 

Subsequently the novel processing pipeline has been evaluated measuring both the 

performance as well as the detection accuracy. In total the processing of over 4000 
frames has been taken into account to measure the pipeline's performance. Furthermore 
over 300 frames have been manually analyzed and labeled, hereby establishing a ground 
truth that served as a basis for comparison in the accuracy evaluation. The performance 
evaluation has shown that the required processing achieves real-time performance even 
with multiple user simultaneously interacting with both their hands. Moreover the 
accuracy evaluation provided very satisfying results bearing in mind that this is only a 
first prototype of the novel processing pipeline. 



5.1 Future Work 

This section presents a short outlook on potential directions for further development. 
These include work on the prototype itself as well as applications that make use of the 
newly introduced features. 
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Figure 5.1: Amoeba fiducial and its topological structure [H] 



5.1.1 Machine Learning supported Fingertip Detection 

Although the accuracy results from the evaluation are very promising, the fingertip 
detection procedure might require further development. Considering both the hit rate 
and false positive rate it is evident that the procedure reveals fingertips with a very 
high confidence. However given the false positive rate of 1.35% there is still room 
for improvement. As the detection is already based on a numerical feature vector, 
one possible solution might be to use machine learning techniques like support vector 
machines for classification of extremal regions as fingertip or not. Since most of these 
algorithms however require a training data set of labeled examples of true and false 
fingertips, this comes at the cost of significantly higher manual work. 



5.1.2 Marker Support 

While the user's hands are usually the main input modality on tabletop displays they 
also lead to the user interface being the sole way of conveying information and func- 
tionality. However in certain situations tangible input devices that can be placed upon 
the tabletop might make functionality easier to grasp and hence might be considered a 
welcome extension of tabletop interfaces. The reacTable is a widely known prototype 
that intuitively combines both tangible objects as well as traditional multi-touch as user 
interaction techniques. Quoting the reacTable here is no coincidence as their marker 
engine, the Amoeba Engine, bears certain similarities to the processing used in this 
pipeline. Amoeba fiducials are characterized by the topological structure of black and 



white regions in the marker as shown in figure 5.1 Enforcing additional constraints on 
the tree structure makes marker detection in the Amoeba engine very robust. Hence 
one could imagine designing a similar marker engine based on the component tree that 
is already computed in our processing pipeline. As the component tree only includes 
regions that are brighter than their surrounding their fiducials would obviously not be 
compatible with such an approach. However since amoeba fiducials only require a tree 
depth of 2, three gray levels (preferably white, gray, black to achieve maximum contrast) 
would be sufficient to build a comparable tree structure. Though it remains to be shown 
whether such a structure could achieve a similar robustness without the use of additional 
features. 
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Figure 5.2: Design mock-up of a multi-touch application taking advantage of the addi- 
tional interaction features. 



5.1.3 User Interaction Studies 

The novel processing pipeline and algorithms have been shown to reliably detect multi- 
touch user input while revealing further input properties such as the hand used for 
interaction. However all this is only a means to an end, that is enabling novel user 
interaction techniques. While the pipeline has been evaluated on a technical level, the 
important next step would be to make use of these features in multi-touch applications 
and study whether these properties can make interaction on tabletop displays a more 
seamless and intuitive experience. 



Figure 5^ illustrates a possible application scenario that takes advantage of the ability 
to distinguish left and right hands. One could imagine that placing the non-dominant 
hand on the tabletop triggers the display of a menu where menu items are aligned with 
the fingers. Tapping a finger would either execute the corresponding action or display a 
submenu that again is aligned with the current finger positions. Hence the non-dominant 
hand is used for navigational tasks and mode-switching while the dominant hand is used 
for tasks that require particular precision. Furthermore that scenario would suit both 
novice and expert users. Novice users would benefit from the visual feedback provided 
by the menu display while expert users would only need to execute a tap sequence. 
Moreover as the non-dominant hand could be placed anywhere on the table, expert 
users could perform tasks without even looking at their hand during interaction. 
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